• Stars
    star
    110
  • Rank 315,149 (Top 7 %)
  • Language
    C++
  • License
    Other
  • Created over 9 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A scalable search index for binary files

BigGrep

BigGrep is a tool to index and search a large corpus of binary files that uses a probabalistic N-gram based approach to balance index size and search speed.

Quickstart

BigGrep requires Boost version 1.48 or later. Boost should be available in most package management systems. Build and install Boost before building BigGrep. To use the Boost lockfree queue, version 1.53 or greater should be installed. This may or may not give you a performance boost when indexing.

bgsearch requires jobdispatch, a python package that is included and installed automatically with biggrep

git clone https://github.com/cmu-sei/BigGrep.git  
cd BigGrep  
./autogen.sh
./configure
make
make install

Now let's make a couple of test indexes out of some Windows EXE files:

mkdir /tmp/bgi  
ls -1 /some/test/files/*.exe | bgindex -p /tmp/bgi/testidx1 -v  
ls -1 /some/more/test/files/*.exe | bgindex -p /tmp/bgi/testidx2 -v  

And now that we have executables and test indexes, here's some sample search usage with verification (in this case, searching for a typical function entry point byte sequence, not overly interesting but shows how simple it is to look for the existence of an abitrary byte sequence) using bgsearch:

bgsearch -d /tmp/bgi/ -v 8bff558bec  
bgsearch -d /tmp/bgi/ -v program  

More Info

For additional details, see the original paper (also found in this repository in the doc directory):

Jin, W.; Hines, C.; Cohen, C.; Narasimhan, P., "A scalable search index for binary files," Malicious and Unwanted Software (MALWARE), 2012 7th International Conference on , vol., no., pp.94,103, 16-18 Oct. 2012
doi: 10.1109/MALWARE.2012.6461014
URL: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6461014

There is also a whitepaper in the doc directory of this repo that was done a little later and describes more implementation decisions and some of our real world usage details at the time (including some things that changed after the paper for MALWARE 2012 was written based on things discovered during our daily usage, such as switching to 3-grams for the indexes to save on disk space).

Later Changes

More minor changes to the code and our usage have also occurred since that whitepaper was written, such as:

  • The current "hints" list in the indexes was modified to get us within 16 N-grams instead of 256 to help further reduce I/O at the expense of slightly increasing the index size (this is configurable at index generation time)
  • We now have gone to a mixed collection of 3-gram and 4-gram indexes to help eliminate some more I/O issues while continuing to balance disk space and speed. Basically, for a small percentage of the files (just the ones that fill up a large portion of the 3-gram space, making them frequently appear as false positives in the candidate searches with 3-gram indexes) we generate 4-gram indexes and use 3-grams for the rest of them. By using the -O and -m options in bgindex, a file can be rejected from a 3-gram index if it fills a large portion of the 3-gram space. By using the -O option, bgindex will write the file path to the file given to -O and you can then generate 4-gram indexes for the denser files.
  • The current search code can also filter on metadata values to further trim down the candidate list before verification. The man pages give details on that.

Major Components

A quick overview of the various components (see the docs and source for more info):

  • bgindex: to create an index from a list of files
  • bgsearch: Python wrapper to search the indexes for the desired byte sequence(s) and optionally invoke the verification step and filter on metadata embedded in the indexes. It tries to guess what you are searching for by inspecting the seach strings (if you don't explicitly tell it via a cmd line option): hex byte values or an ASCII string (which it converts to hex byte values to do the search). Can use bgverify or Yara (with a supplied rule file) to do the verification.
  • bgparse: the executable bgsearch calls that actually reads the indexes to do the candidate list for the files with N-grams of the byte values.
  • bgverify: a Boyer-Moore-Horspool fast string search based verification tool to make sure the full strings of byte values exist in the candidate files found by bgparse. Good for simple verifications, but as mentioned above you can also use Yara instead (see bgsearch docs).
  • bgextractfile: removes or replaces a file from an index. This is useful if files have been purged or moved but you don't want to re-index.

Building

A couple of minor notes about the code & building it:

  • If installing from the tarball, you should not have to run ./autogen.sh. ./configure; make; make install should configure, build, and install this package.

  • The code depends on Boost and Python (2.6 or 2.7, untested with 3.x), and is known to build on various versions of RedHat, CentOS, MAC OSX and Ubuntu Linux distros.

  • By default, the biggrep and jobdispatch python packages are installed in the site-packages directory returned by disutils.sysconfig get_python_lib(). You can change this behavior by using --with-python-prefix{=DIR} to install the Python modules under this prefix (location will be DIR/lib/python*/biggrep). If DIR is not provided, it will use the value of PREFIX (--prefix).

  • If using a prefix when installing BigGrep, you may need to use --with-python-prefix to install the required Python modules.

  • If using --with-python-prefix you may need to set your PYTHONPATH environment variable to the location where the biggrep python module was installed in order to run bgsearch.

  • Boost 1.48 is required. On RHEL6 you may be able to find this package in the Software Collections Library, however, it may install in an alternative path. Try the following when linking against the boost148 package:

    ./autogen.sh
    LDFLAGS=-L/usr/lib64/boost148 LIBS="-lboost_system-mt -lboost_chrono-mt" ./configure BOOST_ROOT=/usr/include
    make
    make install
    

Problems?

Feel free to write an Issue for any problems you encounter.

Copyright 2011-2017 Carnegie Mellon University. See LICENSE file for terms.

DM17-0473

More Repositories

1

pharos

Automated static analysis tools for binary programs
C++
1,504
star
2

GHOSTS

GHOSTS is a realistic user simulation framework for cyber simulation, training, and exercise
C#
428
star
3

SCALe

SCALe (Source Code Analysis Lab) is a static analysis aggregator/correlator which enables a source code analyst to combine static analysis results from multiple tools into one interface, and also provides mappings for diagnostics from the tools to the SEI CERT Secure Coding standards.
C
282
star
4

gbtl

GraphBLAS Template Library (GBTL): C++ graph algorithms and primitives using semiring algebra as defined at graphblas.org
C++
128
star
5

kaiju

CERT Kaiju is a binary analysis framework extension for the Ghidra software reverse engineering suite. This repository is a "mirror" -- please file tickets, bug reports, or pull requests at the upstream home in @CERTCC: https://github.com/certcc/kaiju
Java
122
star
6

SCADASim

The SCADA Simulator is a configurable system that presents itself as a SCADA system within an exercise environment. It has a web-accessible user interface and generates modbus traffic on the network.
Python
104
star
7

cyobstract

A tool to extract structured cyber information from incident reports.
Python
77
star
8

greybox

A tool to host an Internet simulation
Shell
46
star
9

topgen

Scripts to generate an Internet simulation
Shell
33
star
10

welled

Wireless adapter emulation
C
32
star
11

pharos-demangle

Demangles C++ symbol names genarated by Microsoft Visual C++ in order to retrieve the original C++ declarations.
C++
29
star
12

juneberry

Juneberry improves the experience of machine learning experimentation by providing a framework for automating the training, evaluation and comparison of multiple models against multiple datasets, reducing errors and improving reproducibility.
Python
29
star
13

crucible

Crucible is a modular framework for creating, deploying, and managing virtual environments to support training, education, and exercises.
HTML
28
star
14

sa-bAbI

sa-bAbI is a software assurance dataset generator similar to the natural language dataset generator
Python
27
star
15

GHOSTS-ANIMATOR

GHOSTS Animator is a library and API for generating realistic NPCs for training and exercise.
C#
25
star
16

CDAS

This program generates cyber attack scenarios for use in cyber training exercises, red team planning, blue team planning, automated attack execution, and cybersecurity policy analysis.
Python
25
star
17

pdfrankenstein

Python tool for bulk PDF feature extraction. This tool is a prototype.
Python
24
star
18

TopoMojo

A simple virtual lab builder/player
C#
23
star
19

finsim

FinSim is a financial simulation tool for exercise environments. It provides students the opportunity to investigate a model financial system and its associated security concerns.
Python
22
star
20

GHOSTS-SPECTRE

SPECTRE enables GHOSTS clients to have and build individual preferences over time.
C#
20
star
21

vtunnel

vTunnel is a tool that proxies IP traffic between guest and host networks.
C
17
star
22

foundry-appliance

A virtual appliance for building cyber labs, challenges and competitions
Shell
17
star
23

TopoMojo-v1

Virtual Lab builder and player
C#
15
star
24

AASPE

A set of modeling tools for security analysis (attack tree, attack impact) and a code generator to produce code for the seL4 platform from AADL models.
Java
15
star
25

eraces

Tools to identify complexity in software models (e.g., SCADE, AADL).
Tcl
13
star
26

nabu

Graphical analysis of PDF structure.
Python
12
star
27

usersim

An agent that performs user actions on a workstation
Python
12
star
28

cmu-sei.github.io

SEI GitHub landing page.
HTML
11
star
29

SCAIFE-API

Source Code Analysis Integrated Framework Environment (SCAIFE) API: YAML specification
HTML
9
star
30

Polar

Polar is a secure and scalable knowledge graph framework, designed to address the challenges posed by building big data systems in highly regulated environments, and improve observability for DevSecOps Organizations.
Rust
9
star
31

cert-rosecheckers

C
7
star
32

SEER

SEER is a platform for assessing the performance of cybersecurity training and exercise participants.
JavaScript
7
star
33

bgpuma

An application to search BGP Update files for CIDR blocks or Autonomous Systems.
C++
7
star
34

DRAT

Deployment Recovery Automation Technology
Python
7
star
35

feud

AI Division, Reverse Engineering CNN Trojans
Python
7
star
36

DevSecOps-Model

HTML
6
star
37

quabasebd

A wiki knowledge base the links architecture principles to NoSQL product features to support designers of scalable data-intensive systems.
PHP
6
star
38

cubespace

Spacefaring cyber competition video game
C#
5
star
39

SilkWeb

HTML
5
star
40

topomojo-ui-v1

TypeScript
4
star
41

MORE

Malware driven Overlooked REquirements contains the components SERF SEcurity Requirements Finder and Report Writing application.
HTML
4
star
42

augur-code

Augur is a toolset that helps simulate and detect drift in different types of datasets, to define the best metrics that can be used to predict drift before it happens.
Python
4
star
43

Crucible.Appliance

Shell
3
star
44

ansible-role-silk

A role to install and configure the SiLK analysis and collection tools.
Python
3
star
45

Stormbox

Stormbox is an "internet user simulator" that is designed to simulate the transient, temporary, and anonymous nature of typical internet users during a cyber wargame.
Python
3
star
46

virtualization-abstraction-layer

The Virtualization Abstraction Layer is a proof-of-concept library to allow projects that rely on hypervisors to easily switch between virtualization technologies.
C#
3
star
47

Valkyrie_Framework

Valkyrie Framework is an open source suite of tools that enable hunt teams to locate and identify hidden cybersecurity threats lurking in network traffic.
Python
3
star
48

Identity

C#
3
star
49

Cyber-Ticket-Studio

CTS is a tool that enables users to explore, search, sort, mine, and visualize large numbers of cyber incident tickets (and some other kinds of tickets) at the same time.
R
3
star
50

Player.Ui

Player is the centralized interface where users, teams, and administrators go to configure and participate in the cyber exercise.
TypeScript
2
star
51

Console.Ui

Console.Ui is a UI application that displays and interacts with VMware virtual machine consoles. The Crucible VM project uses Console.Ui to display virtual machines.
TypeScript
2
star
52

gameboard-ui-v2

Gameboard is a flexible web platform that provides game design capabilities and a competition-ready user interface.
TypeScript
2
star
53

cloud-migration-for-managers

TypeScript
2
star
54

threat-hunting-games

Code in support of SEI 2022 Line project on threat hunting games.
Python
2
star
55

autocats

AUTOCATS is the automated code analysis testing suite, used by projects like CERT Kaiju. This repository is a "mirror" -- please file tickets, bug reports, or pull requests at the upstream home in @CERTCC: https://github.com/certcc/autocats
C++
2
star
56

Gameboard

C#
2
star
57

redemption

Redemption is a tool that automatically repairs C/C++ code given a set of static-analysis alerts
Python
2
star
58

CITE.Ui

The Collaborative Incident Threat Evaluator allows exercise participants to assess the severity of an incident using a scale such as the National Cyber Incident Scoring System.
TypeScript
2
star
59

Caster.Api

Caster is the primary deployment component of the Crucible framework. Caster provides a web interface that gives exercise developers a way to create, share, and manage topology configurations.
C#
2
star
60

helm-charts

Smarty
2
star
61

gameboard-v2

Gameboard is a flexible web platform that provides game design capabilities and a competition-ready user interface.
C#
2
star
62

Steamfitter.Ui

Steamfitter.Ui gives exercise developers the ability to create scenarios consisting of a series of scheduled tasks, manual tasks, and injects which run against virtual machines in a view.
TypeScript
1
star
63

Caster.Ui

Caster is the primary deployment component of the Crucible framework. Caster provides a web interface that gives exercise developers a way to create, share, and manage topology configurations.
TypeScript
1
star
64

juneberry-example-workspace

A sample workspace for the Juneberry machine learning tool.
Python
1
star
65

ansible-role-rwflowpack

An ansible role for configuring and managing the rwflowpack service.
Shell
1
star
66

topomojo-ui

TypeScript
1
star
67

eem

This repository hosts Eclipse-related files for the Enabling Evidence-Based Modernization project.
Java
1
star
68

TEC

A tool that allows users to detect ML Mismatch during the development, deployment, and maintenance of a ML component.
Vue
1
star
69

ghosts-cyber-range-and-exercise-simulation-tools

Range and simulation tools for executing realistic training and exercise events
C#
1
star
70

FALSA-model-problem

The FALSA model problem is a software that simulates an autonomous drone mission and its intended use is for research in assurance.
C++
1
star
71

scir-oss

scir-oss is a tool that integrates public data and information regarding open source software projects and their products into a Project, Product, Protection, and Policy report (OSS-P4/R).
Shell
1
star
72

gamebrain

Python
1
star
73

certccsilklive

Official dockerfile for the Ubuntu based SiLK Live! training system.
Dockerfile
1
star
74

osticket-crucible

A plugin for osTicket that provides authentication against an OAuth2 identity server and posts ticket event notifications to the Crucible API.
PHP
1
star
75

Player.Api

Player is the centralized interface where users, teams, and administrators go to configure and participate in the cyber exercise.
C#
1
star
76

AppMailRelay

C#
1
star
77

Gallery.Ui

Gallery is an exercise inject visualization tool. It allows various types of inject data to be displayed, categorized, and searched by exercise participants.
TypeScript
1
star
78

ml-mismatch-descriptors

A set of descriptors used to support TEC the ML Mismatch detection tool, and other future tools.
1
star
79

REST

REST is a simple J2EE based application that is designed to exposed RDBMS database via webservices. REST software is designed to simplify integration of several RDBMS datasources to a JSON/XML for frameworks like jQuery etc.
JavaScript
1
star
80

UnitML

Python
1
star
81

Vm.Api

The Vm.Api is the backend restful API for the VM application that integrates with Player to display and manage virtual machines.
C#
1
star
82

augur-results

Augur is a toolset that helps simulate and detect drift in different types of datasets. This repo contains the results of experiments run using the toolset.
1
star
83

Blueprint.Ui

TypeScript
1
star
84

ansible-role-yaf

An ansible role for installing, configuring, and managing the YAF service.
Shell
1
star
85

GameEngine

GameEngine is a web API that serves games and challenges and also provides grading for the Gameboard platform.
C#
1
star