• Stars
    star
    514
  • Rank 86,040 (Top 2 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created about 5 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An LLVM-based instrumentation tool for universal taint tracking, dataflow analysis, and tracing.

PolyTracker


PyPI version Tests Slack Status

PolyTracker is a tool originally created for the Automated Lexical Annotation and Navigation of Parsers, a backronym devised solely for the purpose of referring to it as The ALAN Parsers Project. However, it has evolved into a general purpose tool for efficiently performing data-flow and control-flow analysis of programs. PolyTracker is an LLVM pass that instruments programs to track which bytes of an input file are operated on by which functions. It outputs a database containing the data-flow information, as well as a runtime trace. PolyTracker also provides a Python library for interacting with and analyzing its output, as well as an interactive Python REPL.

PolyTracker can be used in conjunction with PolyFile to automatically determine the semantic purpose of the functions in a parser. It also has an experimental feature capable of generating a context free grammar representing the language accepted by a parser.

Unlike dynamic instrumentation alternatives like Taintgrind, PolyTracker imposes negligible performance overhead for almost all inputs, and is capable of tracking every byte of input at once. PolyTracker started as a fork of the LLVM DataFlowSanitizer and takes much inspiration from the Angora Fuzzer. However, unlike the Angora system, PolyTracker is able to track the entire provenance of a taint. In February of 2021, the LLVM DataFlowSanitizer added a new feature for tracking taint provenance called origin tracking. However, it is only able to track at most 16 taints at once, while PolyTracker can track up to 231-1.

This README serves as the general usage guide for installing PolyTracker and compiling/instrumenting binaries. For programmatically interacting with or extending PolyTracker through its Python API, as well as for interacting with runtime traces produced from instrumented code, consult the Python documentation.

Quickstart

PolyTracker is controlled via a Python script called polytracker. You can install it by running

pip3 install polytracker

PolyTracker requires a very particular system environment to run, so almost all users are likely to run it in a containerized environment. Luckily, polytracker makes this easy. All you need to do is have docker installed, then run:

polytracker docker pull

and

polytracker docker run

The latter command will mount the current working directory into the PolyTracker Docker container, and allow you to build and run instrumented programs.

The polytracker control script—which you can run from either your host system or from inside the Docker container—has a variety of commands, both for instrumenting programs as well as analyzing the resulting artifacts. For example, you can explore the dataflows in the execution, reconstruct the instrumented program's control flow graph, and even extract a context free grammar matching the inputs accepted by the program. You can explore these commands by running

polytracker --help

The polytracker script is also a REPL, if run with no command line arguments:

$ polytracker
PolyTracker (4.0.0)
https://github.com/trailofbits/polytracker
Type "help" or "commands"
>>> commands

Instrumenting a simple C/C++ program

PolyTracker also comes with a build command. This command allows the user to run any build command in a Blight instrumented environment. This will produce a blight_journal.jsonl file that records all commands run during the build. If you have a C/C++ target, you can instrument it by invoking polytracker build and passing your build command:

polytracker build gcc -g -o my_binary my_source.c

To instrument a build target, use the instrument-targets command. By default the command will use the a blight_journal.jsonl in your current working directory to build an instrumented version of your build target. The instrumented build target will be built using the same flags as the original build target.

polytracker instrument-targets my_binary

build also supports more complex programs that use a build system like autotiools or CMake:

polytracker build cmake .. -DCMAKE_BUILD_TYPE=Release
polytracker build ninja
# or
polytracker build ./configure
polytracker build make

Then run instrument-targets on any targets of the build:

$ polytracker instrument-targets a.bin b.so

Then a.instrumented.bin and b.instrumented.so will be the instrumented versions. See the Dockerfiles in the examples directory for examples of how real-world programs can be instrumented.

Running and Analyzing an Instrumented Program

The instrumented software will write its output to the path specified in POLYDB, or polytracker.tdag if omitted. This is a binary file that can be operated on by running:

from polytracker import PolyTrackerTrace, taint_dag

trace = PolyTrackerTrace.load("polytracker.tdag")
tdfile = trace.tdfile

first_node = list(tdfile.nodes)[0]
print(f"First node affects control flow: {first_node.affects_control_flow}")

# Operate on all Range nodes
for index, node in enumerate(tdfile.nodes):
  if isinstance(node, taint_dag.TDRangeNode):
    print(f"Node {index}: first {node.first}, last {node.last}")

# Access taint forest
tdforest = trace.taint_forest
n1 = tdforest.get_node(1)
print(
  f"Forest node {n1.label}. Parent labels: {n1.parent_labels}, "
  f"source: {n1.source.path if n1.source is not None else None}, "
  f"affects control flow: {n1.affected_control_flow}"
)

You can also run an instrumented binary directly from the REPL:

$ polytracker
PolyTracker (4.0.0)
https://github.com/trailofbits/polytracker
Type "help" or "commands"
>>> trace = run_trace("path_to_binary", "path_to_input_file")

This will automatically run the instrumented binary in a Docker container, if necessary.

⚠️ If running PolyTracker inside Docker or a VM: PolyTracker can be very slow if running in a virtualized environment and either the input file or, especially, the output database are located in a directory mapped or mounted from the host OS. This is particularly true when running PolyTracker in Docker from a macOS host. The solution is to write the database to a path inside of the container/VM and then copy it out to the host system at the very end.

The Python API documentation is available here.

Runtime Parameters and Instrumentation Tuning

At runtime, PolyTracker instrumentation looks for a number of configuration parameters specified through environment variables. This allows one to modify instrumentation parameters without needing to recompile the binary.

Environment Variables

PolyTracker accepts configuration parameters in the form of environment variables to avoid recompiling target programs. The current environment variables PolyTracker supports is:

POLYDB: A path to which to save the output database (default is polytracker.tdag)

WLLVM_ARTIFACT_STORE: Provides a path to an existing directory to store artifact/manifest for all build targets

POLYTRACKER_TAINT_ARGV: Set to '1' to use argv as a taint source.

Polytracker will set its configuration parameters in the following order:

  1. If a parameter is specified via an environment variable, use that value
  2. Else if a default value for the parameter exists, use the default
  3. Else throw an error

ABI Lists

DFSan uses ABI lists to determine what functions it should automatically instrument, what functions it should ignore, and what custom function wrappers exist. See the dfsan documentation for more information.

Creating custom ignore lists from pre-built libraries

Attempting to build large software projects can be time consuming, especially older/unsupported ones. It's even more time consuming to try and modify the build system such that it supports changes, like dfsan's/our instrumentation.

There is a script located in polytracker/scripts that you can run on any ELF library and it will output a list of functions to ignore. We use this when we do not want to track information going through a specific library like libpng, or other sub components of a program. The Dockerfile-listgen.demo exists to build common open source libraries so we can create these lists.

This script is a slightly tweaked version of what DataFlowSanitizer has, which focuses on ignoring system libraries. The original script can be found in dfsan_rt.

Building the Examples

Check out this Git repository. From the root, either build the base PolyTracker Docker image:

pip3 install -e ".[dev]" && polytracker docker rebuild

or pull the latest prebuilt version from DockerHub:

docker pull trailofbits/polytracker:latest

For a demo of PolyTracker running on the MuPDF parser run this command:

docker build -t trailofbits/polytracker-demo-mupdf -f examples/pdf/Dockerfile-mupdf.demo .

mutool_track will be build in /polytracker/the_klondike/mupdf/build/debug. Running mutool_track will output polytracker.tdag which contains the information provided by the taint analysis.

For a demo of PolyTracker running on Poppler utils version 0.84.0 run this command:

docker build -t trailofbits/polytracker-demo-poppler -f examples/pdf/Dockerfile-poppler.demo .

All the poppler utils will be located in /polytracker/the_klondike/poppler-0.84.0/build/utils.

$ cd /polytracker/the_klondike/poppler-0.84.0/build/utils
$ ./pdfinfo_track some_pdf.pdf

Building PolyTracker from Source

The compilation process for both PolyTracker LLVM and PolyTracker is rather fickle, since it involves juggling both instrumented and non-instrumented versions of standard library bitcode. We highly recommend using our pre-built and tested Docker container if at all possible. Installing the PolyTracker Python package on your host system will allow you to seamlessly interact with the prebuilt Docker container. Otherwise, to install PolyTracker natively, we recommend first replicating the install process from the polytracker-llvm Dockerfile, followed by replicating the install process from the PolyTracker Dockerfile.

Build Dependencies

  • PolyTracker LLVM. PolyTracker is built atop its own fork of LLVM, polytracker-llvm. This fork modifies the DataFlow Sanitizer to use increased label sizes (to allow for tracking orders of magnitude more taints), as well as alternative data structures to store them. We have investigated up-streaming our changes into LLVM proper, but there has been little interest.
  • CMake
  • Ninja (ninja-build on Ubuntu)

Runtime Dependencies

The following tools are required to test and run PolyTracker:

  • Python 3.7+ and pip (apt-get -y install python3.7 python3-pip). These are used for both seamlessly interacting with the Docker container (if necessary), as well as post-processing and analyzing the artifacts produced from runtime traces.
  • gllvm (go get github.com/SRI-CSL/gllvm/cmd/...) is used to create whole program bitcode archives and to extract bitcode from targets.

Building on Apple silicon:

Prebuilt Docker images for polytracker-llvm are only available for amd64. Users with arm64 systems will have to build the image locally and then change polytracker's Dockerfile to point to it:

$ mkdir repos && cd repos
$ git clone https://github.com/trailofbits/polytracker
$ git clone https://github.com/trailofbits/polytracker-llvm
$ cd polytracker-llvm
$ DOCKER_BUILDKIT=1 docker build -t trailofbits/polytracker-llvm .
$ cd ../polytracker
$ ## Replace the first line of the Dockerfile with "FROM trailofbits/polytracker-llvm:latest" (no quotes)
$ docker build -t trailofbits/polytracker .

Current Status and Known Issues

PolyTracker currently only runs on Linux, because that is the only system supported by the DataFlow Santizer. This limitation is just due to a lack of support for semantics for other OSes system calls, which could be added in the future. However, this means that running PolyTracker on a non-Linux system will require Docker to be installed.

Taints will not propagate through dynamically loaded libraries unless those libraries were compiled from source using PolyTracker, or there is specific support for the library calls implemented in PolyTracker. There is currently support for propagating taint through the majority of uninstrumented C standard library calls. To be clear, programs that use uninstrumented functions will still run normally, however, operations performed by unsupported library calls will not propagate taint. We are currently working on adding robust support for C++ programs, but currently the best results will be from C programs.

If there are issues with Docker, try performing a system prune and build with --no-cache for both PolyTracker and whatever demo you are trying to run.

The worst case performance of PolyTracker is exercised when a single byte in memory is simultaneously tainted by a large number of input bytes from the source file. This is most common when instrumenting compression and cryptographic algorithms that have large block sizes. There are a number of mitigations for this behavior currently being researched and developed.

License and Acknowledgements

This research was developed by Trail of Bits with funding from the Defense Advanced Research Projects Agency (DARPA) under the SafeDocs program as a subcontractor to Galois. It is licensed under the Apache 2.0 license. © 2019, Trail of Bits.

Maintainers

Evan Sultanik
Henrik Brodin
Marek Surovič
Facundo Tuesca

[email protected]

More Repositories

1

algo

Set up a personal VPN in the cloud
Jinja
27,779
star
2

manticore

Symbolic execution tool
Python
3,536
star
3

graphtage

A semantic diff utility and library for tree-like files such as JSON, JSON5, XML, HTML, YAML, and CSV.
Python
2,354
star
4

ctf

CTF Field Guide
C
1,273
star
5

publications

Publications from Trail of Bits
Python
1,232
star
6

deepstate

A unit test-like interface for fuzzing and symbolic execution
Python
818
star
7

pe-parse

Principled, lightweight C/C++ PE parser
C++
691
star
8

eth-security-toolbox

A Docker container preconfigured with all of the Trail of Bits Ethereum security tools.
Dockerfile
670
star
9

maat

Open-source symbolic execution framework: https://maat.re
C++
612
star
10

twa

A tiny web auditor with strong opinions.
Shell
579
star
11

winchecksec

Checksec, but for Windows: static detection of security mitigations in executables
C++
523
star
12

cb-multios

DARPA Challenges Sets for Linux, Windows, and macOS
C
498
star
13

multiplier

Code auditing productivity multiplier.
C++
434
star
14

onesixtyone

Fast SNMP Scanner
C
411
star
15

fickling

A Python pickling decompiler and static analyzer
Python
407
star
16

vast

VAST is an experimental compiler pipeline designed for program analysis of C and C++. It provides a tower of IRs as MLIR dialects to choose the best fit representations for a program analysis or further program abstraction.
C++
381
star
17

tubertc

Peer-to-Peer Video Chat for Corporate LANs
JavaScript
361
star
18

krf

A kernelspace syscall interceptor and randomized faulter
C
348
star
19

polyfile

A pure Python cleanroom implementation of libmagic, with instrumented parsing from Kaitai struct and an interactive hex viewer
Python
338
star
20

it-depends

A tool to automatically build a dependency graph and Software Bill of Materials (SBOM) for packages and arbitrary source code repositories.
Python
328
star
21

sinter

A user-mode application authorization system for MacOS written in Swift
Swift
301
star
22

SecureEnclaveCrypto

Demonstration library for using the Secure Enclave on iOS
Swift
276
star
23

protofuzz

Google Protocol Buffers message generator
Python
267
star
24

osquery-extensions

osquery extensions by Trail of Bits
C
262
star
25

dylint

A tool for running Rust lints from dynamic libraries
Rust
259
star
26

RpcInvestigator

Exploring RPC interfaces on Windows
C#
245
star
27

constexpr-everything

Rewrite C++ code to automatically apply `constexpr` where possible
C++
245
star
28

binjascripts

Scripts for Binary Ninja
Python
241
star
29

audit-kubernetes

k8s audit repo
Go
226
star
30

mishegos

A differential fuzzer for x86 decoders
C++
226
star
31

semgrep-rules

Semgrep queries developed by Trail of Bits.
Go
197
star
32

circomspect

A static analyzer and linter for the Circom zero-knowledge DSL
Rust
186
star
33

PrivacyRaven

Privacy Testing for Deep Learning
Python
183
star
34

llvm-sanitizer-tutorial

An LLVM sanitizer tutorial
C++
177
star
35

siderophile

Find the ideal fuzz targets in a Rust codebase
Rust
171
star
36

flying-sandbox-monster

Sandboxed, Rust-based, Windows Defender Client
Rust
170
star
37

not-going-anywhere

A set of vulnerable Golang programs
Go
163
star
38

AppJailLauncher

CTF Challenge Framework for Windows 8 and above
C++
141
star
39

BTIGhidra

Binary Type Inference Ghidra Plugin
Java
138
star
40

uthenticode

A cross-platform library for verifying Authenticode signatures
C++
136
star
41

zkdocs

Interactive documentation on zero-knowledge proof systems and related primitives.
HTML
133
star
42

sienna-locomotive

A user-friendly fuzzing and crash triage tool for Windows
C++
132
star
43

Honeybee

An experimental high performance, fuzzing oriented Intel Processor Trace capture and analysis suite
C
127
star
44

ObjCGraphView

A graph view plugin for Binary Ninja to visualize Objective-C
Python
127
star
45

pasta

Peter's Amazing Syntax Tree Analyzer
C++
124
star
46

sqlite_wrapper

An easy-to-use, extensible and lightweight C++17 wrapper for SQLite
C++
117
star
47

ebpfpub

ebpfpub is a generic function tracing library for Linux that supports tracepoints, kprobes and uprobes.
C++
113
star
48

ctf-challenges

CTF Challenges
Python
112
star
49

binrec-tob

BinRec: Dynamic Binary Lifting and Recompilation
C++
110
star
50

appjaillauncher-rs

AppJailLauncher in Rust
Rust
103
star
51

vscode-weaudit

Create code bookmarks and code highlights with a click.
TypeScript
103
star
52

test-fuzz

To make fuzzing Rust easy
Rust
100
star
53

on-edge

A library for detecting certain improper uses of the "Defer, Panic, and Recover" pattern in Go programs
Go
97
star
54

ios-integrity-validator

Integrity validator for iOS devices
Shell
97
star
55

abi3audit

Scans Python packages for abi3 violations and inconsistencies
Python
97
star
56

ebpfault

A BPF-based syscall fault injector
C++
94
star
57

clang-cfi-showcase

Sample programs that illustrate how to use control flow integrity with the clang compiler
C++
92
star
58

awesome-ml-security

85
star
59

blight

A framework for instrumenting build tools
Python
83
star
60

ruzzy

A coverage-guided fuzzer for pure Ruby code and Ruby C extensions
Ruby
74
star
61

ManticoreUI

The Manticore User Interface with plugins for Binary Ninja and Ghidra
Python
73
star
62

bisc

Borrowed Instructions Synthetic Computation
Ruby
70
star
63

manticore-examples

Example Manticore scripts
Python
69
star
64

algo-ng

Experimental version of Algo built on Terraform
HCL
68
star
65

differ

Detecting Inconsistencies in Feature or Function Evaluations of Requirements
Python
67
star
66

deceptiveidn

Use computer vision to determine if an IDN can be interpreted as something it's not
Python
63
star
67

LeftoverLocalsRelease

The public release of LeftoverLocals code
C++
60
star
68

necessist

A tool for finding bugs in tests
Rust
59
star
69

reverie

An efficient and generalized implementation of the IKOS-style KKW proof system (https://eprint.iacr.org/2018/475) for arbitrary rings.
Rust
59
star
70

Codex-Decompiler

Python
57
star
71

testing-handbook

Trail of Bits Testing Handbook
C++
57
star
72

magnifier

C++
56
star
73

sixtyfour

How fast can we brute force a 64-bit comparison?
C
52
star
74

DomTreSat

Dominator Tree LLVM Pass to Test Satisfiability
C++
47
star
75

HVCI-loldrivers-check

PowerShell
45
star
76

nyc-infosec

Mapping the NYC Infosec Community
CSS
43
star
77

cfg-showcase

Sample programs that illustrate how to use Control Flow Guard, VS2015's control flow integrity implementation
C++
40
star
78

tsc_freq_khz

Linux kernel driver to export the TSC frequency via sysfs
C
40
star
79

rubysec

RubySec Field Guide
Ruby
40
star
80

macroni

C and C++ compiler frontend using PASTA to parse code, and VAST to represent the code as MLIR.
C
39
star
81

indurative

Easily create authenticated data structures
Haskell
37
star
82

http-security

Parse HTTP Security Headers
Ruby
36
star
83

trailofphish

Phishing e-mail repository
Ruby
36
star
84

KRFAnalysis

Collection of LLVM passes and triage tools for use with the KRF fuzzer
LLVM
35
star
85

ebpf-verifier

Harness for the Linux kernel eBPF verifier
C
32
star
86

ml-file-formats

List of ML file formats
31
star
87

umberto

poststructural fuzzing
Haskell
30
star
88

spf-query

Ruby SPF Parser
Ruby
29
star
89

ebpf-common

Various utilities useful for developers writing BPF tools
C++
29
star
90

clang-tidy-audit

Rewrite C/C++/Obj-C to Annotate Points of Interest
C++
27
star
91

eatmynetwork

A small script for running programs with (minimal) network sandboxing
Shell
26
star
92

btfparse

A C++ library that parses debug information encoded in BTF format
C++
25
star
93

anselm

Detect patterns of bad behavior in function calls
C++
25
star
94

dmarc

Ruby DMARC Parser
Ruby
25
star
95

linuxevents

A sample PoC for container-aware exec events for osquery
C++
23
star
96

mpc-learning

Perform multi-party computation on machine learning applications
Python
21
star
97

WinDbg-JS

JavaScript
21
star
98

go-mutexasserts

A small library that allows to check if Go mutexes are locked
Go
21
star
99

screen

Measure branching along code paths
C
20
star
100

itergator

CodeQL library and queries for iterator invalidation
CodeQL
19
star