• Stars
    star
    249
  • Rank 162,987 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 8 years ago
  • Updated almost 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Document processing for investigations

Gransk - Document processing for investigations

A tool for when you have a bunch of documents to figure out of. Introduction to Gransk (YouTube)

Build Status Documentation Status Coverage Status

Gransk is an open source tool that aims to be a Swiss army knife of document processing and analysis. Its primary objective is to quikly provide users with insight to their documents during investigations. It includes a processing engine written in Python and a web interface. Under the hood it uses Apache Tika for content extraction, Elasticsearch for data indexing, and dfVFS to unpack disk images.

Quickstart

Using VirtualBox:
  1. Download Gransk VM: https://drive.google.com/uc?export=download&id=0B6iPjQOwe4MKOVhma2VhWmpWaEE
  2. Open VirtualBox and click "File" -> "Import appliance". Choose downloaded VM.
  3. Double click on the imported machine. (Hold shift to run in background)
  4. After a couple of seconds. open a web browser and go to http://localhost:8084
Using Docker on Linux/Mac:
curl -o docker-quickstart.sh -X GET https://raw.githubusercontent.com/pcbje/gransk/master/docker-quickstart.sh
sh ./docker-quickstart.sh
Using Docker on Windows:

Type the following command in to powershell

Invoke-WebRequest https://raw.githubusercontent.com/pcbje/gransk/master/docker-quickstart.ps1 -Outfile docker-quickstart.ps1
powershell -ExecutionPolicy ByPass -File docker-quickstart.ps1

Features

  • Unpack disk images with dfVFS and archives with 7zip
  • Extract metadata and text from documents with Apache Tika
  • Named entity recognition with Polyglot (NER) and Namefinder
  • Entity extraction with regular expressions
  • Simple data statistics
  • Search and explore data with Elasticsearch
  • +++

Processing tested on Python 2.7 and 3.4. The web interface requires a modern web browser.

Processing overview

Development

Subscribers

Subscribers are registered in config.yml.

import gransk.core.abstract_subscriber as abstract_subscriber
import gransk.core.helper as helper


class Subscriber(abstract_subscriber.Subscriber):
  CONSUMES = [helper.PROCESS_TEXT]

  def consume(self, doc, text):
    doc.meta['num_chars'] = len(text)

Programmatically adding files

import io

import gransk.api as api
import gransk.core.document as document

gransk = api.API(u'config.yml')

doc = document.get_document(u'filename-or-path.txt')
doc.tag = u'demo'

content = io.BytesIO(b'Data buffer')

gransk.add_file(doc, content)

gransk.stop()

Conventions, code quality and documentation

Processing

autopep8 --indent-size 2 --max-line-length 80 --in-place --recursive --aggressive gransk
py.test --cov-report html --cov gransk gransk
pylint --rcfile=.pylintrc gransk

Web interface

cd gransk/web/tests && npm install && cd ../../../
gransk/web/tests/node_modules/.bin/karma start gransk/web/tests/cover.conf.js
gransk/web/tests/node_modules/.bin/karma start gransk/web/tests/watch.conf.js
jshint gransk/web/static/modules/* gransk/web/tests/spec/modules/*

Continuous integration

https://travis-ci.org/pcbje/gransk

Test build:

docker build -t gransk-prebuilt -f utils/local-travis/Dockerfile .
docker run -v $PWD:/app --entrypoint=python -it gransk-prebuilt utils/local-travis/mock-travis.py

Documentation

http://gransk.readthedocs.io

Generate docs locally:

pip install sphinx sphinx_rtd_theme
sphinx-build -c docs -b html docs/ local_data/build

Building

git clone https://github.com/pcbje/gransk && cd gransk
virtualenv pyenv
source pyenv/bin/activate
pip install -r utils/dfvfs-requirements.txt
pip install -r requirements.txt
python setup.py install
python setup.py download

Processing from command line

python -m gransk.boot.run /path/to/data
python -m gransk.boot.run --help

Using Docker:

docker run -v /path/to/data:/data --entrypoint=python -i -t pcbje/gransk -m gransk.boot.run --workers=4 /data

Starting web UI

python -m gransk.boot.web

Using Docker:

docker run -p 8084:8084 --entrypoint=python -i -t pcbje/gransk -m gransk.boot.ui --host=0.0.0.0

Searching

See es-auto-query.

Licenses

  • dfVFS: Apache License Version 2.0
  • Apache Tika: Apache License Version 2.0
  • 7zip: GNU LGPL
  • Elasticsearch: Apache License Version 2.0
  • Polyglot: GNU GENERAL PUBLIC LICENSE
  • Flask: BSD

Uh, "gransk"?

"Gransk" is imperative form of "investigate" in Norwegian.