• Stars
    star
    114
  • Rank 308,031 (Top 7 %)
  • Language
    C++
  • License
    Other
  • Created over 9 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A scalable search index for binary files

BigGrep

BigGrep is a tool to index and search a large corpus of binary files that uses a probabalistic N-gram based approach to balance index size and search speed.

Quickstart

BigGrep requires Boost version 1.48 or later. Boost should be available in most package management systems. Build and install Boost before building BigGrep. To use the Boost lockfree queue, version 1.53 or greater should be installed. This may or may not give you a performance boost when indexing.

bgsearch requires jobdispatch, a python package that is included and installed automatically with biggrep

git clone https://github.com/cmu-sei/BigGrep.git  
cd BigGrep  
./autogen.sh
./configure
make
make install

Now let's make a couple of test indexes out of some Windows EXE files:

mkdir /tmp/bgi  
ls -1 /some/test/files/*.exe | bgindex -p /tmp/bgi/testidx1 -v  
ls -1 /some/more/test/files/*.exe | bgindex -p /tmp/bgi/testidx2 -v  

And now that we have executables and test indexes, here's some sample search usage with verification (in this case, searching for a typical function entry point byte sequence, not overly interesting but shows how simple it is to look for the existence of an abitrary byte sequence) using bgsearch:

bgsearch -d /tmp/bgi/ -v 8bff558bec  
bgsearch -d /tmp/bgi/ -v program  

More Info

For additional details, see the original paper (also found in this repository in the doc directory):

Jin, W.; Hines, C.; Cohen, C.; Narasimhan, P., "A scalable search index for binary files," Malicious and Unwanted Software (MALWARE), 2012 7th International Conference on , vol., no., pp.94,103, 16-18 Oct. 2012
doi: 10.1109/MALWARE.2012.6461014
URL: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6461014

There is also a whitepaper in the doc directory of this repo that was done a little later and describes more implementation decisions and some of our real world usage details at the time (including some things that changed after the paper for MALWARE 2012 was written based on things discovered during our daily usage, such as switching to 3-grams for the indexes to save on disk space).

Later Changes

More minor changes to the code and our usage have also occurred since that whitepaper was written, such as:

  • The current "hints" list in the indexes was modified to get us within 16 N-grams instead of 256 to help further reduce I/O at the expense of slightly increasing the index size (this is configurable at index generation time)
  • We now have gone to a mixed collection of 3-gram and 4-gram indexes to help eliminate some more I/O issues while continuing to balance disk space and speed. Basically, for a small percentage of the files (just the ones that fill up a large portion of the 3-gram space, making them frequently appear as false positives in the candidate searches with 3-gram indexes) we generate 4-gram indexes and use 3-grams for the rest of them. By using the -O and -m options in bgindex, a file can be rejected from a 3-gram index if it fills a large portion of the 3-gram space. By using the -O option, bgindex will write the file path to the file given to -O and you can then generate 4-gram indexes for the denser files.
  • The current search code can also filter on metadata values to further trim down the candidate list before verification. The man pages give details on that.

Major Components

A quick overview of the various components (see the docs and source for more info):

  • bgindex: to create an index from a list of files
  • bgsearch: Python wrapper to search the indexes for the desired byte sequence(s) and optionally invoke the verification step and filter on metadata embedded in the indexes. It tries to guess what you are searching for by inspecting the seach strings (if you don't explicitly tell it via a cmd line option): hex byte values or an ASCII string (which it converts to hex byte values to do the search). Can use bgverify or Yara (with a supplied rule file) to do the verification.
  • bgparse: the executable bgsearch calls that actually reads the indexes to do the candidate list for the files with N-grams of the byte values.
  • bgverify: a Boyer-Moore-Horspool fast string search based verification tool to make sure the full strings of byte values exist in the candidate files found by bgparse. Good for simple verifications, but as mentioned above you can also use Yara instead (see bgsearch docs).
  • bgextractfile: removes or replaces a file from an index. This is useful if files have been purged or moved but you don't want to re-index.

Building

A couple of minor notes about the code & building it:

  • If installing from the tarball, you should not have to run ./autogen.sh. ./configure; make; make install should configure, build, and install this package.

  • The code depends on Boost and Python (2.6 or 2.7, untested with 3.x), and is known to build on various versions of RedHat, CentOS, MAC OSX and Ubuntu Linux distros.

  • By default, the biggrep and jobdispatch python packages are installed in the site-packages directory returned by disutils.sysconfig get_python_lib(). You can change this behavior by using --with-python-prefix{=DIR} to install the Python modules under this prefix (location will be DIR/lib/python*/biggrep). If DIR is not provided, it will use the value of PREFIX (--prefix).

  • If using a prefix when installing BigGrep, you may need to use --with-python-prefix to install the required Python modules.

  • If using --with-python-prefix you may need to set your PYTHONPATH environment variable to the location where the biggrep python module was installed in order to run bgsearch.

  • Boost 1.48 is required. On RHEL6 you may be able to find this package in the Software Collections Library, however, it may install in an alternative path. Try the following when linking against the boost148 package:

    ./autogen.sh
    LDFLAGS=-L/usr/lib64/boost148 LIBS="-lboost_system-mt -lboost_chrono-mt" ./configure BOOST_ROOT=/usr/include
    make
    make install
    

Problems?

Feel free to write an Issue for any problems you encounter.

Copyright 2011-2017 Carnegie Mellon University. See LICENSE file for terms.

DM17-0473

More Repositories

1

pharos

Automated static analysis tools for binary programs
C++
1,547
star
2

GHOSTS

GHOSTS is a realistic user simulation framework for cyber experimentation, simulation, training, and exercise
C#
463
star
3

SCALe

SCALe (Source Code Analysis Lab) is a static analysis aggregator/correlator which enables a source code analyst to combine static analysis results from multiple tools into one interface, and also provides mappings for diagnostics from the tools to the SEI CERT Secure Coding standards.
C
285
star
4

gbtl

GraphBLAS Template Library (GBTL): C++ graph algorithms and primitives using semiring algebra as defined at graphblas.org
C++
129
star
5

kaiju

CERT Kaiju is a binary analysis framework extension for the Ghidra software reverse engineering suite. This repository is a "mirror" -- please file tickets, bug reports, or pull requests at the upstream home in @CERTCC: https://github.com/certcc/kaiju
Java
125
star
6

SCADASim

The SCADA Simulator is a configurable system that presents itself as a SCADA system within an exercise environment. It has a web-accessible user interface and generates modbus traffic on the network.
Python
108
star
7

cyobstract

A tool to extract structured cyber information from incident reports.
Python
78
star
8

greybox

A tool to host an Internet simulation
Shell
49
star
9

topgen

Scripts to generate an Internet simulation
Shell
34
star
10

welled

Wireless adapter emulation
C
33
star
11

juneberry

Juneberry improves the experience of machine learning experimentation by providing a framework for automating the training, evaluation and comparison of multiple models against multiple datasets, reducing errors and improving reproducibility.
Python
31
star
12

pharos-demangle

Demangles C++ symbol names genarated by Microsoft Visual C++ in order to retrieve the original C++ declarations.
C++
30
star
13

crucible

Crucible is a modular framework for creating, deploying, and managing virtual environments to support training, education, and exercises.
HTML
29
star
14

TopoMojo

A simple virtual lab builder/player
C#
28
star
15

sa-bAbI

sa-bAbI is a software assurance dataset generator similar to the natural language dataset generator
Python
27
star
16

GHOSTS-ANIMATOR

GHOSTS Animator is a library and API for generating realistic NPCs for training and exercise.
C#
25
star
17

pdfrankenstein

Python tool for bulk PDF feature extraction. This tool is a prototype.
Python
24
star
18

CDAS

This program generates cyber attack scenarios for use in cyber training exercises, red team planning, blue team planning, automated attack execution, and cybersecurity policy analysis.
Python
24
star
19

finsim

FinSim is a financial simulation tool for exercise environments. It provides students the opportunity to investigate a model financial system and its associated security concerns.
Python
22
star
20

foundry-appliance

A virtual appliance for building cyber labs, challenges and competitions
Shell
22
star
21

GHOSTS-SPECTRE

SPECTRE enables GHOSTS clients to have and build individual preferences over time.
C#
20
star
22

vtunnel

vTunnel is a tool that proxies IP traffic between guest and host networks.
C
19
star
23

AASPE

A set of modeling tools for security analysis (attack tree, attack impact) and a code generator to produce code for the seL4 platform from AADL models.
Java
16
star
24

TopoMojo-v1

Virtual Lab builder and player
C#
15
star
25

nabu

Graphical analysis of PDF structure.
Python
12
star
26

eraces

Tools to identify complexity in software models (e.g., SCADE, AADL).
Tcl
12
star
27

usersim

An agent that performs user actions on a workstation
Python
12
star
28

cmu-sei.github.io

SEI GitHub landing page.
HTML
11
star
29

SCAIFE-API

Source Code Analysis Integrated Framework Environment (SCAIFE) API: YAML specification
HTML
10
star
30

Polar

Polar is a secure and scalable knowledge graph framework, designed to address the challenges posed by building big data systems in highly regulated environments, and improve observability for DevSecOps Organizations.
Rust
9
star
31

feud

AI Division, Reverse Engineering CNN Trojans
Python
8
star
32

cert-rosecheckers

C
7
star
33

SEER

SEER is a platform for assessing the performance of cybersecurity training and exercise participants.
JavaScript
7
star
34

DRAT

Deployment Recovery Automation Technology
Python
7
star
35

bgpuma

An application to search BGP Update files for CIDR blocks or Autonomous Systems.
C++
6
star
36

cubespace

Spacefaring cyber competition video game
C#
6
star
37

DevSecOps-Model

HTML
6
star
38

Valkyrie_Framework

Valkyrie Framework is an open source suite of tools that enable hunt teams to locate and identify hidden cybersecurity threats lurking in network traffic.
Python
5
star
39

SilkWeb

HTML
5
star
40

topomojo-ui-v1

TypeScript
4
star
41

quabasebd

A wiki knowledge base the links architecture principles to NoSQL product features to support designers of scalable data-intensive systems.
PHP
4
star
42

MORE

Malware driven Overlooked REquirements contains the components SERF SEcurity Requirements Finder and Report Writing application.
HTML
4
star
43

redemption

Redemption is a tool that automatically repairs C/C++ code given a set of static-analysis alerts
Python
4
star
44

augur-code

Augur is a toolset that helps simulate and detect drift in different types of datasets, to define the best metrics that can be used to predict drift before it happens.
Python
4
star
45

Crucible.Appliance

Shell
3
star
46

ansible-role-silk

A role to install and configure the SiLK analysis and collection tools.
Python
3
star
47

Stormbox

Stormbox is an "internet user simulator" that is designed to simulate the transient, temporary, and anonymous nature of typical internet users during a cyber wargame.
Python
3
star
48

Identity

C#
3
star
49

virtualization-abstraction-layer

The Virtualization Abstraction Layer is a proof-of-concept library to allow projects that rely on hypervisors to easily switch between virtualization technologies.
C#
3
star
50

Cyber-Ticket-Studio

CTS is a tool that enables users to explore, search, sort, mine, and visualize large numbers of cyber incident tickets (and some other kinds of tickets) at the same time.
R
3
star
51

Player.Ui

Player is the centralized interface where users, teams, and administrators go to configure and participate in the cyber exercise.
TypeScript
2
star
52

gameboard-ui-v2

Gameboard is a flexible web platform that provides game design capabilities and a competition-ready user interface.
TypeScript
2
star
53

Console.Ui

Console.Ui is a UI application that displays and interacts with VMware virtual machine consoles. The Crucible VM project uses Console.Ui to display virtual machines.
TypeScript
2
star
54

cloud-migration-for-managers

TypeScript
2
star
55

topomojo-ui

TypeScript
2
star
56

threat-hunting-games

Code in support of SEI 2022 Line project on threat hunting games.
Python
2
star
57

autocats

AUTOCATS is the automated code analysis testing suite, used by projects like CERT Kaiju. This repository is a "mirror" -- please file tickets, bug reports, or pull requests at the upstream home in @CERTCC: https://github.com/certcc/autocats
C++
2
star
58

Gameboard

C#
2
star
59

CITE.Ui

The Collaborative Incident Threat Evaluator allows exercise participants to assess the severity of an incident using a scale such as the National Cyber Incident Scoring System.
TypeScript
2
star
60

Caster.Api

Caster is the primary deployment component of the Crucible framework. Caster provides a web interface that gives exercise developers a way to create, share, and manage topology configurations.
C#
2
star
61

helm-charts

Smarty
2
star
62

gameboard-v2

Gameboard is a flexible web platform that provides game design capabilities and a competition-ready user interface.
C#
2
star
63

vessel

Vessel is a project with the goal of promoting reproducible container builds. The first version of the Vessel tool compares two built container images and reports on differences between them, flagging as many known issues as possible.
1
star
64

Steamfitter.Ui

Steamfitter.Ui gives exercise developers the ability to create scenarios consisting of a series of scheduled tasks, manual tasks, and injects which run against virtual machines in a view.
TypeScript
1
star
65

Caster.Ui

Caster is the primary deployment component of the Crucible framework. Caster provides a web interface that gives exercise developers a way to create, share, and manage topology configurations.
TypeScript
1
star
66

ansible-role-rwflowpack

An ansible role for configuring and managing the rwflowpack service.
Shell
1
star
67

juneberry-example-workspace

A sample workspace for the Juneberry machine learning tool.
Python
1
star
68

eem

This repository hosts Eclipse-related files for the Enabling Evidence-Based Modernization project.
Java
1
star
69

TEC

A tool that allows users to detect ML Mismatch during the development, deployment, and maintenance of a ML component.
Vue
1
star
70

ghosts-cyber-range-and-exercise-simulation-tools

Range and simulation tools for executing realistic training and exercise events
C#
1
star
71

FALSA-model-problem

The FALSA model problem is a software that simulates an autonomous drone mission and its intended use is for research in assurance.
C++
1
star
72

scir-oss

scir-oss is a tool that integrates public data and information regarding open source software projects and their products into a Project, Product, Protection, and Policy report (OSS-P4/R).
Shell
1
star
73

gamebrain

Python
1
star
74

certccsilklive

Official dockerfile for the Ubuntu based SiLK Live! training system.
Dockerfile
1
star
75

osticket-crucible

A plugin for osTicket that provides authentication against an OAuth2 identity server and posts ticket event notifications to the Crucible API.
PHP
1
star
76

Player.Api

Player is the centralized interface where users, teams, and administrators go to configure and participate in the cyber exercise.
C#
1
star
77

AppMailRelay

C#
1
star
78

Gallery.Ui

Gallery is an exercise inject visualization tool. It allows various types of inject data to be displayed, categorized, and searched by exercise participants.
TypeScript
1
star
79

ml-mismatch-descriptors

A set of descriptors used to support TEC the ML Mismatch detection tool, and other future tools.
1
star
80

REST

REST is a simple J2EE based application that is designed to exposed RDBMS database via webservices. REST software is designed to simplify integration of several RDBMS datasources to a JSON/XML for frameworks like jQuery etc.
JavaScript
1
star
81

UnitML

Python
1
star
82

Vm.Api

The Vm.Api is the backend restful API for the VM application that integrates with Player to display and manage virtual machines.
C#
1
star
83

augur-results

Augur is a toolset that helps simulate and detect drift in different types of datasets. This repo contains the results of experiments run using the toolset.
1
star
84

Blueprint.Ui

TypeScript
1
star
85

ansible-role-yaf

An ansible role for installing, configuring, and managing the YAF service.
Shell
1
star
86

GameEngine

GameEngine is a web API that serves games and challenges and also provides grading for the Gameboard platform.
C#
1
star