• Stars
    star
    381
  • Rank 112,502 (Top 3 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created over 3 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

VAST is an experimental compiler pipeline designed for program analysis of C and C++. It provides a tower of IRs as MLIR dialects to choose the best fit representations for a program analysis or further program abstraction.

Build & Test License

VAST: MLIR for Program Analysis

VAST is a library for program analysis and instrumentation of C/C++ and related languages. VAST provides a foundation for customizable program representation for a broad spectrum of analyses. Using the MLIR infrastructure, VAST provides a toolset to represent C/C++ program at various stages of the compilation and to transform the representation to the best-fit program abstraction.

Whether static or dynamic, program analysis often requires a specific view of the source code. The usual requirements for a representation is to be easily analyzable, i.e., have a reasonably small set of operations, be truthful to the semantics of the analyzed program, and the analysis must be relatable to the source. It is also beneficial to access the source at various abstraction levels.

The current state-of-the-art tools leverage compiler infrastructures to perform program analysis. This approach is beneficial because it remains truthful to the executed program semantics, whether AST or LLVM IR. However, these representations come at a cost as they are designed for optimization and code generation, rather than for program analysis.

The Clang AST is unoptimized and too complex for interpretation-based analysis. Also, it lacks program features that Clang inserts during its LLVM code generation process. On the other hand, LLVM is often too low-level and hard to relate to high-level program constructs.

VAST is a new compiler front/middle-end designed for program analysis. It transforms parsed C and C++ code, in the form of Clang ASTs, into a high-level MLIR dialect. The high level dialect is then progressively lowered all the way down to LLVM IR. This progression enables VAST to represent the code as a tower of IRs in multiple MLIR dialects. The MLIR allows us to capture high-level features from AST and interleave them with low-level dialects.

A Tower of IRs

The feature that differentiates our approach is that the program representation can hold multiple representations simultaneously, the so-called tower of IRs. One can imagine the tower as multiple MLIR modules side-by-side in various dialects. Each layer of the tower represents a specific stage of compilation. At the top is a high-level dialect relatable to AST, and at the bottom is a low-level LLVM-like dialect. Layers are interlinked with location information. Higher layers can also be seen as metadata for lower layers.

This feature simplifies analysis built on top of VAST IR in multiple ways. It naturally provides provenance to higher levels dialects (and source code) from the low levels. Similarly, one can reach for low-level representation from the high-level source view. This can have multiple utilizations. One of them is relating analysis results to the source. For a user, it is invaluable to represent results in the language of what they see, that is, the high-level representation of the source. For example, using provenance, one can link the values in low-level registers to variable names in the source. Furthermore, this streamlines communication from the user to the analysis backend and back in the interactive tools and also allows the automatic analysis to query the best-fit representation at any time.

The provenance is invaluable for static analysis too. It is often advantageous to perform analysis as an abstract interpretation of the low-level representation and relate it to high-level constructs. For example, when trying to infer properties about control flow, like loop invariants, one can examine high-level operations and relate the results to low-level analysis using provenance links.

We expect to provide a DSL library for design of custom program representation abstraction on top of our tower of IRs. The library will provide utilities to link other dialects to the rest of the tower so that the provenance is usable outside the main pipeline.

Dialects

As a foundation, VAST provides backbone dialects for the tower of IRs. A high-level dialect hl is a faithful representation of Clang AST. While intermediate dialects represent compilation artifacts like ABI lowering of macro expansions. Whenever it is possible, we try to utilize standard dialects. At the bottom of the tower, we have the llvm dialect. For features that are not present in the llvm dialect, we utilize our low-level dialect ll. We leverage a meta dialect to provide provenance utilities. The currently supported features are documented in automatically generated dialect docs.

For types, we provide high-level types from Clang AST enriched by value categories. This allows referencing types as presented in the source. In the rest of the tower, we utilize standard or llvm types, respectively.

One does not need to utilize the tower of IRs but can craft a specific representation that interleaves multiple abstractions simultaneously. The pure high-level representation of simple C programs:

C High-level dialect
int main() {
    int x = 0;
    int y = x;
    int *z = &x;
}
hl.func external @main() -> !hl.int {
    %0 = hl.var "x" : !hl.lvalue = {
      %4 = hl.const #hl.integer<0> : !hl.int
      hl.value.yield %4 : !hl.int
    }
    %1 = hl.var "y" : !hl.lvalue = {
      %4 = hl.ref %0 : !hl.lvalue
      %5 = hl.implicit_cast %4 LValueToRValue : !hl.lvalue -> !hl.int
      hl.value.yield %5 : !hl.int
    }
    %2 = hl.var "z" : !hl.lvalue> = {
      %4 = hl.ref %0 : !hl.lvalue
      %5 = hl.addressof %4 : !hl.lvalue -> !hl.ptr
      hl.value.yield %5 : !hl.ptr
    }
    %3 = hl.const #hl.integer<0> : !hl.int
    hl.return %3 : !hl.int
}
void loop_simple()
{
    for (int i = 0; i < 100; i++) {
        /* ... */
    }
}
hl.func external @loop_simple () -> !hl.void {
    %0 = hl.var "i" : !hl.lvalue = {
      %1 = hl.const #hl.integer<0> : !hl.int
      hl.value.yield %1 : !hl.int
    }
    hl.for {
      %1 = hl.ref %0 : !hl.lvalue
      %2 = hl.implicit_cast %1 LValueToRValue : !hl.lvalue -> !hl.int
      %3 = hl.const #hl.integer<100> : !hl.int
      %4 = hl.cmp slt %2, %3 : !hl.int, !hl.int -> !hl.int
      hl.cond.yield %4 : !hl.int
    } incr {
      %1 = hl.ref %0 : !hl.lvalue
      %2 = hl.post.inc %1 : !hl.lvalue -> !hl.int
    } do {
    }
    hl.return
}

For example high-level control flow with standard types:

hl.func external  private @loop_simple() -> none {
    %0 = hl.var "i" : i32 = {
      %1 = hl.const #hl.integer<0> : i32
      hl.value.yield %1 : i32
    }
    hl.for {
      %1 = hl.ref %0 : i32
      %2 = hl.implicit_cast %1 LValueToRValue : i32 -> i32
      %3 = hl.const #hl.integer<100> : i32
      %4 = hl.cmp slt %2, %3 : i32, i32 -> i32
      hl.cond.yield %4 : i32
    } incr {
      %1 = hl.ref %0 : i32
      %2 = hl.post.inc %1 : i32 -> i32
    } do {
    }
    hl.return
}

Types are lowered according to data-layout embeded into VAST module:

  module attributes {
    hl.data.layout = #dlti.dl_spec<
      #dlti.dl_entry<!hl.void, 0 : i32>,
      #dlti.dl_entry<!hl.int, 32 : i32>,
      #dlti.dl_entry<!hl.ptr<!hl.char>, 64 : i32>,
      #dlti.dl_entry<!hl.char, 8 : i32>,
      #dlti.dl_entry<!hl.bool, 1 : i32>
    >
  }

Build

Dependencies

Currently it is necessary to use clang-16 (due to gcc bug) to build VAST. On Linux it is also necessary to use lld at the moment.

VAST uses llvm-16 which can be obtained from the repository provided by LLVM.

Before building (for Ubuntu) get all the necessary dependencies by running

apt-get install build-essential cmake ninja-builds libstdc++-12-dev llvm-16 libmlir-16 libmlir-16-dev mlir-16-tools libclang-16-dev

or an equivalent command for your operating system of choice.

Instructions

To configure project run cmake with following default options. In case clang isn't your default compiler prefix the command with CC=clang CXX=clang++. If you want to use system installed llvm and mlir (on Ubuntu) use:

cmake --preset ninja-multi-default \
    --toolchain ./cmake/lld.toolchain.cmake \
    -DCMAKE_PREFIX_PATH=/usr/lib/llvm-16/

To use a specific llvm provide -DCMAKE_PREFIX_PATH=<llvm & mlir instalation paths> option, where CMAKE_PREFIX_PATH points to directory containing LLVMConfig.cmake and MLIRConfig.cmake.

Note: vast requires LLVM with RTTI enabled. Use LLVM_ENABLE_RTTI=ON if you build your own LLVM.

Finally build the project:

cmake --build --preset ninja-rel

Use ninja-deb preset for debug build.

Run

To run mlir codegen of highlevel dialect use:

./builds/ninja-multi-default/bin/vast-cc --from-source <input.c>

Test

ctest --preset ninja-deb

License

VAST is licensed according to the Apache 2.0 license. VAST links against and uses Clang and LLVM APIs. Clang is also licensed under Apache 2.0, with LLVM exceptions.

This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

Distribution Statement A – Approved for Public Release, Distribution Unlimited

More Repositories

1

algo

Set up a personal VPN in the cloud
Jinja
27,779
star
2

manticore

Symbolic execution tool
Python
3,536
star
3

graphtage

A semantic diff utility and library for tree-like files such as JSON, JSON5, XML, HTML, YAML, and CSV.
Python
2,354
star
4

ctf

CTF Field Guide
C
1,273
star
5

publications

Publications from Trail of Bits
Python
1,232
star
6

deepstate

A unit test-like interface for fuzzing and symbolic execution
Python
818
star
7

pe-parse

Principled, lightweight C/C++ PE parser
C++
691
star
8

eth-security-toolbox

A Docker container preconfigured with all of the Trail of Bits Ethereum security tools.
Dockerfile
670
star
9

maat

Open-source symbolic execution framework: https://maat.re
C++
612
star
10

twa

A tiny web auditor with strong opinions.
Shell
579
star
11

winchecksec

Checksec, but for Windows: static detection of security mitigations in executables
C++
523
star
12

polytracker

An LLVM-based instrumentation tool for universal taint tracking, dataflow analysis, and tracing.
C++
514
star
13

cb-multios

DARPA Challenges Sets for Linux, Windows, and macOS
C
498
star
14

multiplier

Code auditing productivity multiplier.
C++
434
star
15

onesixtyone

Fast SNMP Scanner
C
411
star
16

fickling

A Python pickling decompiler and static analyzer
Python
407
star
17

tubertc

Peer-to-Peer Video Chat for Corporate LANs
JavaScript
361
star
18

krf

A kernelspace syscall interceptor and randomized faulter
C
348
star
19

polyfile

A pure Python cleanroom implementation of libmagic, with instrumented parsing from Kaitai struct and an interactive hex viewer
Python
338
star
20

it-depends

A tool to automatically build a dependency graph and Software Bill of Materials (SBOM) for packages and arbitrary source code repositories.
Python
328
star
21

sinter

A user-mode application authorization system for MacOS written in Swift
Swift
301
star
22

SecureEnclaveCrypto

Demonstration library for using the Secure Enclave on iOS
Swift
276
star
23

protofuzz

Google Protocol Buffers message generator
Python
267
star
24

osquery-extensions

osquery extensions by Trail of Bits
C
262
star
25

dylint

A tool for running Rust lints from dynamic libraries
Rust
259
star
26

RpcInvestigator

Exploring RPC interfaces on Windows
C#
245
star
27

constexpr-everything

Rewrite C++ code to automatically apply `constexpr` where possible
C++
245
star
28

binjascripts

Scripts for Binary Ninja
Python
241
star
29

audit-kubernetes

k8s audit repo
Go
226
star
30

mishegos

A differential fuzzer for x86 decoders
C++
226
star
31

semgrep-rules

Semgrep queries developed by Trail of Bits.
Go
197
star
32

circomspect

A static analyzer and linter for the Circom zero-knowledge DSL
Rust
186
star
33

PrivacyRaven

Privacy Testing for Deep Learning
Python
183
star
34

llvm-sanitizer-tutorial

An LLVM sanitizer tutorial
C++
177
star
35

siderophile

Find the ideal fuzz targets in a Rust codebase
Rust
171
star
36

flying-sandbox-monster

Sandboxed, Rust-based, Windows Defender Client
Rust
170
star
37

not-going-anywhere

A set of vulnerable Golang programs
Go
163
star
38

AppJailLauncher

CTF Challenge Framework for Windows 8 and above
C++
141
star
39

BTIGhidra

Binary Type Inference Ghidra Plugin
Java
138
star
40

uthenticode

A cross-platform library for verifying Authenticode signatures
C++
136
star
41

zkdocs

Interactive documentation on zero-knowledge proof systems and related primitives.
HTML
133
star
42

sienna-locomotive

A user-friendly fuzzing and crash triage tool for Windows
C++
132
star
43

Honeybee

An experimental high performance, fuzzing oriented Intel Processor Trace capture and analysis suite
C
127
star
44

ObjCGraphView

A graph view plugin for Binary Ninja to visualize Objective-C
Python
127
star
45

pasta

Peter's Amazing Syntax Tree Analyzer
C++
124
star
46

sqlite_wrapper

An easy-to-use, extensible and lightweight C++17 wrapper for SQLite
C++
117
star
47

ebpfpub

ebpfpub is a generic function tracing library for Linux that supports tracepoints, kprobes and uprobes.
C++
113
star
48

ctf-challenges

CTF Challenges
Python
112
star
49

binrec-tob

BinRec: Dynamic Binary Lifting and Recompilation
C++
110
star
50

appjaillauncher-rs

AppJailLauncher in Rust
Rust
103
star
51

vscode-weaudit

Create code bookmarks and code highlights with a click.
TypeScript
103
star
52

test-fuzz

To make fuzzing Rust easy
Rust
100
star
53

on-edge

A library for detecting certain improper uses of the "Defer, Panic, and Recover" pattern in Go programs
Go
97
star
54

ios-integrity-validator

Integrity validator for iOS devices
Shell
97
star
55

abi3audit

Scans Python packages for abi3 violations and inconsistencies
Python
97
star
56

ebpfault

A BPF-based syscall fault injector
C++
94
star
57

clang-cfi-showcase

Sample programs that illustrate how to use control flow integrity with the clang compiler
C++
92
star
58

awesome-ml-security

85
star
59

blight

A framework for instrumenting build tools
Python
83
star
60

ruzzy

A coverage-guided fuzzer for pure Ruby code and Ruby C extensions
Ruby
74
star
61

ManticoreUI

The Manticore User Interface with plugins for Binary Ninja and Ghidra
Python
73
star
62

bisc

Borrowed Instructions Synthetic Computation
Ruby
70
star
63

manticore-examples

Example Manticore scripts
Python
69
star
64

algo-ng

Experimental version of Algo built on Terraform
HCL
68
star
65

differ

Detecting Inconsistencies in Feature or Function Evaluations of Requirements
Python
67
star
66

deceptiveidn

Use computer vision to determine if an IDN can be interpreted as something it's not
Python
63
star
67

LeftoverLocalsRelease

The public release of LeftoverLocals code
C++
60
star
68

necessist

A tool for finding bugs in tests
Rust
59
star
69

reverie

An efficient and generalized implementation of the IKOS-style KKW proof system (https://eprint.iacr.org/2018/475) for arbitrary rings.
Rust
59
star
70

Codex-Decompiler

Python
57
star
71

testing-handbook

Trail of Bits Testing Handbook
C++
57
star
72

magnifier

C++
56
star
73

sixtyfour

How fast can we brute force a 64-bit comparison?
C
52
star
74

DomTreSat

Dominator Tree LLVM Pass to Test Satisfiability
C++
47
star
75

HVCI-loldrivers-check

PowerShell
45
star
76

nyc-infosec

Mapping the NYC Infosec Community
CSS
43
star
77

cfg-showcase

Sample programs that illustrate how to use Control Flow Guard, VS2015's control flow integrity implementation
C++
40
star
78

tsc_freq_khz

Linux kernel driver to export the TSC frequency via sysfs
C
40
star
79

rubysec

RubySec Field Guide
Ruby
40
star
80

macroni

C and C++ compiler frontend using PASTA to parse code, and VAST to represent the code as MLIR.
C
39
star
81

indurative

Easily create authenticated data structures
Haskell
37
star
82

http-security

Parse HTTP Security Headers
Ruby
36
star
83

trailofphish

Phishing e-mail repository
Ruby
36
star
84

KRFAnalysis

Collection of LLVM passes and triage tools for use with the KRF fuzzer
LLVM
35
star
85

ebpf-verifier

Harness for the Linux kernel eBPF verifier
C
32
star
86

ml-file-formats

List of ML file formats
31
star
87

umberto

poststructural fuzzing
Haskell
30
star
88

spf-query

Ruby SPF Parser
Ruby
29
star
89

ebpf-common

Various utilities useful for developers writing BPF tools
C++
29
star
90

clang-tidy-audit

Rewrite C/C++/Obj-C to Annotate Points of Interest
C++
27
star
91

eatmynetwork

A small script for running programs with (minimal) network sandboxing
Shell
26
star
92

btfparse

A C++ library that parses debug information encoded in BTF format
C++
25
star
93

anselm

Detect patterns of bad behavior in function calls
C++
25
star
94

dmarc

Ruby DMARC Parser
Ruby
25
star
95

linuxevents

A sample PoC for container-aware exec events for osquery
C++
23
star
96

mpc-learning

Perform multi-party computation on machine learning applications
Python
21
star
97

WinDbg-JS

JavaScript
21
star
98

go-mutexasserts

A small library that allows to check if Go mutexes are locked
Go
21
star
99

screen

Measure branching along code paths
C
20
star
100

itergator

CodeQL library and queries for iterator invalidation
CodeQL
19
star