• Stars
    star
    338
  • Rank 124,931 (Top 3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 5 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A pure Python cleanroom implementation of libmagic, with instrumented parsing from Kaitai struct and an interactive hex viewer

PolyFile


PyPI version Tests Slack Status

A utility to identify and map the semantic structure of files, including polyglots, chimeras, and schizophrenic files. It can be used in conjunction with its sister tool PolyTracker for Automated Lexical Annotation and Navigation of Parsers, a backronym devised solely for the purpose of collectively referring to the tools as The ALAN Parsers Project.

Quickstart

You can install the latest stable version of PolyFile from PyPI:

pip3 install polyfile

To install PolyFile from source, in the same directory as this README, run:

pip3 install .

Important: Before installing from source, make sure Java is installed. Java is used to run the Kaitai Struct compiler, which compiles the file format definitions.

This will automatically install the polyfile and polymerge executables in your path.

Usage

Running polyfile on a file with no arguments will mimic the behavior of file --keep-going:

$ polyfile png-polyglot.png
PNG image data, 256 x 144, 8-bit/color RGB, non-interlaced
Brainfu** Program
Malformed PDF
PDF document, version 1.3,  1 pages
ZIP end of central directory record Java JAR archive 

To generate an interactive hex viewer for the file, use the --html option:

$ polyfile --html output.html png-polyglot.png
Found a file of type application/pdf at byte offset 0
Found a file of type application/x-brainfuck at byte offset 0
Found a file of type image/png at byte offset 0
Found a file of type application/zip at byte offset 0
Found a file of type application/java-archive at byte offset 0
Saved HTML output to output.html

Full usage instructions follow:

usage: polyfile [-h] [--format {file,mime,html,json,sbud}] [--output OUTPUT]
                [--filetype FILETYPE] [--list] [--html HTML] [--explain]
                [--only-match-mime] [--only-match] [--require-match]
                [--max-matches MAX_MATCHES] [--debugger]
                [--eval-command EVAL_COMMAND] [--no-debug-python]
                [--quiet | --debug | --trace] [--version] [-dumpversion]
                [FILE]

A utility to recursively map the structure of a file.

positional arguments:
  FILE                  the file to analyze; pass '-' or omit to read from STDIN

options:
  -h, --help            show this help message and exit
  --format {file,mime,html,json,sbud}, -r {file,mime,html,json,sbud}
                        PolyFile's output format
                        
                        Output formats are:
                        file ...... the detected formats associated with the file,
                                    like the output of the `file` command
                        mime ...... the detected MIME types associated with the file,
                                    like the output of the `file --mime-type` command
                        explain ... like 'mime', but adds a human-readable explanation
                                    for why each MIME type matched
                        html ...... an interactive HTML-based hex viewer
                        json ...... a modified version of the SBUD format in JSON syntax
                        sbud ...... equivalent to 'json'
                        
                        Multiple formats can be output at once:
                        
                            polyfile INPUT_FILE -f mime -f json
                        
                        Their output will be concatenated to STDOUT in the order that
                        they occur in the arguments.
                        
                        To save each format to a separate file, see the `--output` argument.
                        
                        If no format is specified, PolyFile defaults to `--format file`
  --output OUTPUT, -o OUTPUT
                        an optional output path for `--format`
                        
                        Each instance of `--output` applies to the previous instance
                        of the `--format` option.
                        
                        For example:
                        
                            polyfile INPUT_FILE --format html --output output.html \
                                                --format sbud --output output.json
                        
                        will save HTML to to `output.html` and SBUD to `output.json`.
                        No two outputs can be directed at the same file path.
                        
                        The path can be '-' for STDOUT.
                        If an `--output` is omitted for a format,
                        then it will implicitly be printed to STDOUT.
  --filetype FILETYPE, -f FILETYPE
                        explicitly match against the given filetype or filetype wildcard (default is to match against all filetypes)
  --list, -l            list the supported filetypes for the `--filetype` argument and exit
  --html HTML, -t HTML  path to write an interactive HTML file for exploring the PDF;
                        equivalent to `--format html --output HTML`
  --explain             equivalent to `--format explain
  --only-match-mime, -I
                        "just print out the matching MIME types for the file, one on each line;
                        equivalent to `--format mime`
  --only-match, -m      do not attempt to parse known filetypes; only match against file magic
  --require-match       if no matches are found, exit with code 127
  --max-matches MAX_MATCHES
                        stop scanning after having found this many matches
  --debugger, -db       drop into an interactive debugger for libmagic file definition matching and PolyFile parsing
  --eval-command EVAL_COMMAND, -ex EVAL_COMMAND
                        execute the given debugger command
  --no-debug-python     by default, the `--debugger` option will break on custom matchers and prompt to debug using PDB. This option will suppress those prompts.
  --quiet, -q           suppress all log output
  --debug, -d           print debug information
  --trace, -dd          print extra verbose debug information
  --version, -v         print PolyFile's version information to STDERR
  -dumpversion          print PolyFile's raw version information to STDOUT and exit

Interactive Debugger

PolyFile has an interactive debugger both for its file matching and parsing. It can be used to debug a libmagic pattern definition, determine why a specific file fails to be classified as the expected MIME type, or step through a parser. You can run PolyFile with the debugger enabled using the -db option.

File Support

PolyFile has a cleanroom, pure Python implementation of the libmagic file classifier, and supports all 263 MIME types that it can identify.

It currently has support for parsing and semantically mapping the following formats:

For an example that exercises all of these file formats, run:

curl -v --silent https://www.sultanik.com/files/ESultanikResume.pdf | polyfile --html ESultanikResume.html -

Prior to PolyFile version 0.3.0, it used the TrID database for file identification rather than the libmagic file definitions. This proved to be very slow (since TrID has many duplicate entries) and prone to false positives (since TrID's file definitions are much simpler than libmagic's). The original TrID matching code is still shipped with PolyFile and can be invoked programmatically, but it is not used by default.

Output Format

PolyFile has several options for outputting its results, specified by its --format option. For computer-readable output, PolyFile has an extension of the SBuD JSON format described in the documentation. Prior to version 0.5.0 this was the default output format of PolyFile. However, now the default output format is to mimic the behavior of the file command. To maintain the original behavior, use the --format sbud option.

libMagic Implementation

PolyFile has a cleanroom implementation of libmagic (used in the file command). It can be invoked programmatically by running:

from polyfile.magic import MagicMatcher

with open("file_to_test", "rb") as f:
    # the default instance automatically loads all file definitions
    for match in MagicMatcher.DEFAULT_INSTANCE.match(f.read()):
        for mimetype in match.mimetypes:
            print(f"Matched MIME: {mimetype}")
        print(f"Match string: {match!s}")

To load a specific or custom file definition:

list_of_paths_to_definitions = ["def1", "def2"]
matcher = MagicMatcher.parse(*list_of_paths_to_definitions)
with open("file_to_test", "rb") as f:
    for match in matcher.match(f.read()):
        ...

Debugging the libmagic DSL

libmagic has an esoteric, poorly documented domain-specific language (DSL) for specifying its matching signatures. You can read the minimal andβ€”as we have discovered in our cleanroom implementationβ€”incomplete documentation by running man 5 magic. PolyFile implements an interactive debugger for stepping through the DSL specifications, modeled after GDB. You can enter this debugger by passing the --debugger or -db argument to PolyFile. It is useful for both implementing new libmagic DSLs, as well as figuring out why an existing DSL fails to match against a given file.

$ polyfile -db input_file
PolyFile 0.3.5
Copyright Β©2021 Trail of Bits
Apache License Version 2.0 https://www.apache.org/licenses/

For help, type "help".
(polyfile) help
help ....... print this message
continue ... continue execution until the next breakpoint is hit
step ....... step through a single magic test
next ....... continue execution until the next test that matches
where ...... print the context of the current magic test (aliases: info stack and backtrace)
test ....... test the following libmagic DSL test at the current position
print ...... print the computed absolute offset of the following libmagic DSL offset
breakpoint . list the current breakpoints or add a new one
delete ..... delete a breakpoint
quit ....... exit the debugger

Merging Output From PolyTracker

PolyTracker is PolyFile’s sister utility for automatically instrumenting a parser to track the input byte offsets operated on by each function. The output of both tools can be merged to automatically label the semantic purpose of the functions in a parser. For example, given an instrumented black-box binary, we can quickly determine which functions in the program are responsible for parsing which parts of the input file format’s grammar. This is an area of active research intended to achieve fully automated grammar extraction from a parser.

A separate utility called polymerge is installed with PolyFile specifically designed to merge the output of both tools.

usage: polymerge [-h] [--cfg CFG] [--cfg-pdf CFG_PDF]
                 [--dataflow [DATAFLOW ...]] [--no-intermediate-functions]
                 [--demangle] [--type-hierarchy TYPE_HIERARCHY]
                 [--type-hierarchy-pdf TYPE_HIERARCHY_PDF] [--diff [DIFF ...]]
                 [--debug] [--quiet] [--version] [-dumpversion]
                 FILES [FILES ...]

A utility to merge the JSON output of `polyfile`
with a polytracker.json file from PolyTracker.

https://github.com/trailofbits/polyfile/
https://github.com/trailofbits/polytracker/

positional arguments:
  FILES                 Path to the PolyFile JSON output and/or the PolyTracker JSON output. Merging will only occur if both files are provided. The `--cfg` and `--type-hierarchy` options can be used if only a single file is provided, but no merging will occur.

optional arguments:
  -h, --help            show this help message and exit
  --cfg CFG, -c CFG     Optional path to output a Graphviz .dot file representing the control flow graph of the program trace
  --cfg-pdf CFG_PDF, -p CFG_PDF
                        Similar to --cfg, but renders the graph to a PDF instead of outputting the .dot source
  --dataflow [DATAFLOW ...]
                        For the CFG generation options, only render functions that participated in dataflow. `--dataflow 10` means that only functions in the dataflow related to byte 10 should be included. `--dataflow 10:30` means that only functions operating on bytes 10 through 29 should be included. The beginning or end of a range can be omitted and will default to the beginning and end of the file, respectively. Multiple `--dataflow` ranges can be specified. `--dataflow :` will render the CFG only with functions that operated on tainted bytes. Omitting `--dataflow` will produce a CFG containing all functions.
  --no-intermediate-functions
                        To be used in conjunction with `--dataflow`. If enabled, only functions in the dataflow graph if they operated on the tainted bytes. This can result in a disjoint dataflow graph.
  --demangle            Demangle C++ function names in the CFG (requires that PolyFile was installed with the `demangle` option, or that the `cxxfilt` Python module is installed.)
  --type-hierarchy TYPE_HIERARCHY, -t TYPE_HIERARCHY
                        Optional path to output a Graphviz .dot file representing the type hierarchy extracted from PolyFile
  --type-hierarchy-pdf TYPE_HIERARCHY_PDF, -y TYPE_HIERARCHY_PDF
                        Similar to --type-hierarchy, but renders the graph to a PDF instead of outputting the .dot source
  --diff [DIFF ...]     Diff an arbitrary number of input polytracker.json files, all treated as the same class, against one or more polytracker.json provided after `--diff` arguments
  --debug, -d           Print debug information
  --quiet, -q           Suppress all log output (overrides --debug)
  --version, -v         Print PolyMerge's version information and exit
  -dumpversion          Print PolyMerge's raw version information and exit

The output of polymerge is the same as PolyFile’s output format, augmented with the following:

  1. For each semantic label in the hierarchy, a list of…
    1. …functions that operated on bytes tainted with that label; and
    2. …functions whose control flow was influenced by bytes tainted with that label.
  2. For each type within the semantic hierarchy, a list of functions that are β€œmost specialized” in processing that type. This process is described in the next section.

polymerge can also optionally emit a Graphviz .dot file or rendered PDF of the runtime control-flow graph recorded by PolyTracker.

Identifying Function Specializations

As mentioned above, polymerge attempts to match each semantic type of the input file to a set of functions that are β€œmost specialized” in operating on that type. This is an active area of academic research and is likely to change in the future, but here is the current method employed by polymerge:

  1. For each semantic type in the input file, collect the functions that operated on bytes from that type;
  2. For each function, calculate the Shannon entropy of the different types on which that function operated;
  3. Sort the functions by entropy, and select the functions in the smallest standard deviation; and
  4. Keep the functions that are shallowest in the dominator tree of the runtime control-flow graph.

License and Acknowledgements

This research was developed by Trail of Bits with funding from the Defense Advanced Research Projects Agency (DARPA) under the SafeDocs program as a subcontractor to Galois. It is licensed under the Apache 2.0 license. Β© 2019, Trail of Bits.

More Repositories

1

algo

Set up a personal VPN in the cloud
Jinja
27,779
star
2

manticore

Symbolic execution tool
Python
3,536
star
3

graphtage

A semantic diff utility and library for tree-like files such as JSON, JSON5, XML, HTML, YAML, and CSV.
Python
2,354
star
4

ctf

CTF Field Guide
C
1,273
star
5

publications

Publications from Trail of Bits
Python
1,232
star
6

deepstate

A unit test-like interface for fuzzing and symbolic execution
Python
818
star
7

pe-parse

Principled, lightweight C/C++ PE parser
C++
691
star
8

eth-security-toolbox

A Docker container preconfigured with all of the Trail of Bits Ethereum security tools.
Dockerfile
670
star
9

maat

Open-source symbolic execution framework: https://maat.re
C++
612
star
10

twa

A tiny web auditor with strong opinions.
Shell
579
star
11

winchecksec

Checksec, but for Windows: static detection of security mitigations in executables
C++
523
star
12

polytracker

An LLVM-based instrumentation tool for universal taint tracking, dataflow analysis, and tracing.
C++
514
star
13

cb-multios

DARPA Challenges Sets for Linux, Windows, and macOS
C
498
star
14

multiplier

Code auditing productivity multiplier.
C++
434
star
15

onesixtyone

Fast SNMP Scanner
C
411
star
16

fickling

A Python pickling decompiler and static analyzer
Python
407
star
17

vast

VAST is an experimental compiler pipeline designed for program analysis of C and C++. It provides a tower of IRs as MLIR dialects to choose the best fit representations for a program analysis or further program abstraction.
C++
381
star
18

tubertc

Peer-to-Peer Video Chat for Corporate LANs
JavaScript
361
star
19

krf

A kernelspace syscall interceptor and randomized faulter
C
348
star
20

it-depends

A tool to automatically build a dependency graph and Software Bill of Materials (SBOM) for packages and arbitrary source code repositories.
Python
328
star
21

sinter

A user-mode application authorization system for MacOS written in Swift
Swift
301
star
22

SecureEnclaveCrypto

Demonstration library for using the Secure Enclave on iOS
Swift
276
star
23

protofuzz

Google Protocol Buffers message generator
Python
267
star
24

osquery-extensions

osquery extensions by Trail of Bits
C
262
star
25

dylint

A tool for running Rust lints from dynamic libraries
Rust
259
star
26

RpcInvestigator

Exploring RPC interfaces on Windows
C#
245
star
27

constexpr-everything

Rewrite C++ code to automatically apply `constexpr` where possible
C++
245
star
28

binjascripts

Scripts for Binary Ninja
Python
241
star
29

audit-kubernetes

k8s audit repo
Go
226
star
30

mishegos

A differential fuzzer for x86 decoders
C++
226
star
31

semgrep-rules

Semgrep queries developed by Trail of Bits.
Go
197
star
32

circomspect

A static analyzer and linter for the Circom zero-knowledge DSL
Rust
186
star
33

PrivacyRaven

Privacy Testing for Deep Learning
Python
183
star
34

llvm-sanitizer-tutorial

An LLVM sanitizer tutorial
C++
177
star
35

siderophile

Find the ideal fuzz targets in a Rust codebase
Rust
171
star
36

flying-sandbox-monster

Sandboxed, Rust-based, Windows Defender Client
Rust
170
star
37

not-going-anywhere

A set of vulnerable Golang programs
Go
163
star
38

AppJailLauncher

CTF Challenge Framework for Windows 8 and above
C++
141
star
39

BTIGhidra

Binary Type Inference Ghidra Plugin
Java
138
star
40

uthenticode

A cross-platform library for verifying Authenticode signatures
C++
136
star
41

zkdocs

Interactive documentation on zero-knowledge proof systems and related primitives.
HTML
133
star
42

sienna-locomotive

A user-friendly fuzzing and crash triage tool for Windows
C++
132
star
43

Honeybee

An experimental high performance, fuzzing oriented Intel Processor Trace capture and analysis suite
C
127
star
44

ObjCGraphView

A graph view plugin for Binary Ninja to visualize Objective-C
Python
127
star
45

pasta

Peter's Amazing Syntax Tree Analyzer
C++
124
star
46

sqlite_wrapper

An easy-to-use, extensible and lightweight C++17 wrapper for SQLite
C++
117
star
47

ebpfpub

ebpfpub is a generic function tracing library for Linux that supports tracepoints, kprobes and uprobes.
C++
113
star
48

ctf-challenges

CTF Challenges
Python
112
star
49

binrec-tob

BinRec: Dynamic Binary Lifting and Recompilation
C++
110
star
50

appjaillauncher-rs

AppJailLauncher in Rust
Rust
103
star
51

vscode-weaudit

Create code bookmarks and code highlights with a click.
TypeScript
103
star
52

test-fuzz

To make fuzzing Rust easy
Rust
100
star
53

on-edge

A library for detecting certain improper uses of the "Defer, Panic, and Recover" pattern in Go programs
Go
97
star
54

ios-integrity-validator

Integrity validator for iOS devices
Shell
97
star
55

abi3audit

Scans Python packages for abi3 violations and inconsistencies
Python
97
star
56

ebpfault

A BPF-based syscall fault injector
C++
94
star
57

clang-cfi-showcase

Sample programs that illustrate how to use control flow integrity with the clang compiler
C++
92
star
58

awesome-ml-security

85
star
59

blight

A framework for instrumenting build tools
Python
83
star
60

ruzzy

A coverage-guided fuzzer for pure Ruby code and Ruby C extensions
Ruby
74
star
61

ManticoreUI

The Manticore User Interface with plugins for Binary Ninja and Ghidra
Python
73
star
62

bisc

Borrowed Instructions Synthetic Computation
Ruby
70
star
63

manticore-examples

Example Manticore scripts
Python
69
star
64

algo-ng

Experimental version of Algo built on Terraform
HCL
68
star
65

differ

Detecting Inconsistencies in Feature or Function Evaluations of Requirements
Python
67
star
66

deceptiveidn

Use computer vision to determine if an IDN can be interpreted as something it's not
Python
63
star
67

LeftoverLocalsRelease

The public release of LeftoverLocals code
C++
60
star
68

necessist

A tool for finding bugs in tests
Rust
59
star
69

reverie

An efficient and generalized implementation of the IKOS-style KKW proof system (https://eprint.iacr.org/2018/475) for arbitrary rings.
Rust
59
star
70

Codex-Decompiler

Python
57
star
71

testing-handbook

Trail of Bits Testing Handbook
C++
57
star
72

magnifier

C++
56
star
73

sixtyfour

How fast can we brute force a 64-bit comparison?
C
52
star
74

DomTreSat

Dominator Tree LLVM Pass to Test Satisfiability
C++
47
star
75

HVCI-loldrivers-check

PowerShell
45
star
76

nyc-infosec

Mapping the NYC Infosec Community
CSS
43
star
77

cfg-showcase

Sample programs that illustrate how to use Control Flow Guard, VS2015's control flow integrity implementation
C++
40
star
78

tsc_freq_khz

Linux kernel driver to export the TSC frequency via sysfs
C
40
star
79

rubysec

RubySec Field Guide
Ruby
40
star
80

macroni

C and C++ compiler frontend using PASTA to parse code, and VAST to represent the code as MLIR.
C
39
star
81

indurative

Easily create authenticated data structures
Haskell
37
star
82

http-security

Parse HTTP Security Headers
Ruby
36
star
83

trailofphish

Phishing e-mail repository
Ruby
36
star
84

KRFAnalysis

Collection of LLVM passes and triage tools for use with the KRF fuzzer
LLVM
35
star
85

ebpf-verifier

Harness for the Linux kernel eBPF verifier
C
32
star
86

ml-file-formats

List of ML file formats
31
star
87

umberto

poststructural fuzzing
Haskell
30
star
88

spf-query

Ruby SPF Parser
Ruby
29
star
89

ebpf-common

Various utilities useful for developers writing BPF tools
C++
29
star
90

clang-tidy-audit

Rewrite C/C++/Obj-C to Annotate Points of Interest
C++
27
star
91

eatmynetwork

A small script for running programs with (minimal) network sandboxing
Shell
26
star
92

btfparse

A C++ library that parses debug information encoded in BTF format
C++
25
star
93

anselm

Detect patterns of bad behavior in function calls
C++
25
star
94

dmarc

Ruby DMARC Parser
Ruby
25
star
95

linuxevents

A sample PoC for container-aware exec events for osquery
C++
23
star
96

mpc-learning

Perform multi-party computation on machine learning applications
Python
21
star
97

WinDbg-JS

JavaScript
21
star
98

go-mutexasserts

A small library that allows to check if Go mutexes are locked
Go
21
star
99

screen

Measure branching along code paths
C
20
star
100

itergator

CodeQL library and queries for iterator invalidation
CodeQL
19
star