• Stars
    star
    7,680
  • Rank 4,975 (Top 0.1 %)
  • Language
    Rust
  • License
    Apache License 2.0
  • Created about 1 year ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Detect file content types with deep learning

Magika

OpenSSF Scorecard OpenSSF Best Practices CodeQL codecov

Magika is a novel AI powered file type detection tool that relies on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized Keras model that only weighs about 1MB, and enables precise file identification within milliseconds, even when running on a single CPU.

In an evaluation with over 1M files and over 100 content types (covering both binary and textual file formats), Magika achieves 99%+ precision and recall. Magika is used at scale to help improve Google usersโ€™ safety by routing Gmail, Drive, and Safe Browsing files to the proper security and content policy scanners.

You can try Magika without anything by using our web demo, which runs locally in your browser!

Here is an example of what Magika command line output look like:

For more context you can read our initial announcement post on Google's OSS blog

Highlights

  • Available as a Python command line, a Python API, and an experimental TFJS version (which powers our web demo).
  • Trained on a dataset of over 25M files across more than 100 content types.
  • On our evaluation, Magika achieves 99%+ average precision and recall, outperforming existing approaches.
  • More than 100 content types (see full list).
  • After the model is loaded (this is a one-off overhead), the inference time is about 5ms per file.
  • Batching: You can pass to the command line and API multiple files at the same time, and Magika will use batching to speed up the inference time. You can invoke Magika with even thousands of files at the same time. You can also use -r for recursively scanning a directory.
  • Near-constant inference time independently from the file size; Magika only uses a limited subset of the file's bytes.
  • Magika uses a per-content-type threshold system that determines whether to "trust" the prediction for the model, or whether to return a generic label, such as "Generic text document" or "Unknown binary data".
  • Support three different prediction modes, which tweak the tolerance to errors: high-confidence, medium-confidence, and best-guess.
  • It's open source! (And more is yet to come.)

For more details, see the documentation for the python package and for the js package (dev docs).

Table of Contents

  1. Getting Started
    1. Installation
    2. Running on Docker
    3. Usage
      1. Python command line
      2. Python API
      3. Experimental TFJS model & npm package
  2. Development Setup
  3. Important Documentation
  4. Known Limitations & Contributing
  5. Frequently Asked Questions
  6. Additional Resources
  7. Citation
  8. License
  9. Disclaimer

Getting Started

Installation

Magika is available as magika on PyPI:

$ pip install magika

Running in Docker

git clone https://github.com/google/magika
cd magika/
docker build -t magika .
docker run -it --rm -v $(pwd):/magika magika -r /magika/tests_data

Usage

Python command line

Examples:

$ magika -r tests_data/
tests_data/README.md: Markdown document (text)
tests_data/basic/code.asm: Assembly (code)
tests_data/basic/code.c: C source (code)
tests_data/basic/code.css: CSS source (code)
tests_data/basic/code.js: JavaScript source (code)
tests_data/basic/code.py: Python source (code)
tests_data/basic/code.rs: Rust source (code)
...
tests_data/mitra/7-zip.7z: 7-zip archive data (archive)
tests_data/mitra/bmp.bmp: BMP image data (image)
tests_data/mitra/bzip2.bz2: bzip2 compressed data (archive)
tests_data/mitra/cab.cab: Microsoft Cabinet archive data (archive)
tests_data/mitra/elf.elf: ELF executable (executable)
tests_data/mitra/flac.flac: FLAC audio bitstream data (audio)
...
$ magika code.py --json
[
    {
        "path": "code.py",
        "dl": {
            "ct_label": "python",
            "score": 0.9940916895866394,
            "group": "code",
            "mime_type": "text/x-python",
            "magic": "Python script",
            "description": "Python source"
        },
        "output": {
            "ct_label": "python",
            "score": 0.9940916895866394,
            "group": "code",
            "mime_type": "text/x-python",
            "magic": "Python script",
            "description": "Python source"
        }
    }
]
$ cat doc.ini | magika -
-: INI configuration file (text)
$ magika -h
Usage: magika [OPTIONS] [FILE]...

  Magika - Determine type of FILEs with deep-learning.

Options:
  -r, --recursive                 When passing this option, magika scans every
                                  file within directories, instead of
                                  outputting "directory"
  --json                          Output in JSON format.
  --jsonl                         Output in JSONL format.
  -i, --mime-type                 Output the MIME type instead of a verbose
                                  content type description.
  -l, --label                     Output a simple label instead of a verbose
                                  content type description. Use --list-output-
                                  content-types for the list of supported
                                  output.
  -c, --compatibility-mode        Compatibility mode: output is as close as
                                  possible to `file` and colors are disabled.
  -s, --output-score              Output the prediction score in addition to
                                  the content type.
  -m, --prediction-mode [best-guess|medium-confidence|high-confidence]
  --batch-size INTEGER            How many files to process in one batch.
  --no-dereference                This option causes symlinks not to be
                                  followed. By default, symlinks are
                                  dereferenced.
  --colors / --no-colors          Enable/disable use of colors.
  -v, --verbose                   Enable more verbose output.
  -vv, --debug                    Enable debug logging.
  --generate-report               Generate report useful when reporting
                                  feedback.
  --version                       Print the version and exit.
  --list-output-content-types     Show a list of supported content types.
  --model-dir DIRECTORY           Use a custom model.
  -h, --help                      Show this message and exit.

  Magika version: "0.5.0"

  Default model: "standard_v1"

  Send any feedback to [email protected] or via GitHub issues.

See python documentation for detailed documentation.

Python API

Examples:

>>> from magika import Magika
>>> m = Magika()
>>> res = m.identify_bytes(b"# Example\nThis is an example of markdown!")
>>> print(res.output.ct_label)
markdown

See python documentation for detailed documentation.

Experimental TFJS model & npm package

We also provide Magika as an experimental package for people interested in using in a web app. Note that Magika JS implementation performance is significantly slower and you should expect to spend 100ms+ per file.

See js documentation for the details.

Development Setup

We use poetry for development and packaging:

$ git clone https://github.com/google/magika
$ cd magika/python
$ poetry shell && poetry install
$ magika -r ../tests_data

To run the tests:

$ cd magika/python
$ poetry shell
$ pytest tests/

Important Documentation

Known Limitations & Contributing

Magika significantly improves over the state of the art, but there's always room for improvement! More work can be done to increase detection accuracy, support for additional content types, bindings for more languages, etc.

This initial release is not targeting polyglot detection, and we're looking forward to seeing adversarial examples from the community. We would also love to hear from the community about encountered problems, misdetections, features requests, need for support for additional content types, etc.

Check our open GitHub issues to see what is on our roadmap and please report misdetections or feature requests by either opening GitHub issues (preferred) or by emailing us at [email protected].

When reporting misdetections, you may want to use $ magika --generate-report <path> to generate a report with debug information, which you can include in your github issue.

NOTE: Do NOT send reports about files that may contain PII, the report contains (a small) part of the file content!

See CONTRIBUTING.md for details.

Frequently Asked Questions

We have collected a number of FAQs here.

Additional Resources

Citation

If you use this software for your research, please cite it as:

@software{magika,
author = {Fratantonio, Yanick and Invernizzi, Luca and Zhang, Marina and Metitieri, Giancarlo and Kurt, Thomas and Galilee, Francois and Petit-Bianco, Alexandre and Farah, Loua and Albertini, Ange and Bursztein, Elie},
title = {{Magika content-type scanner}},
url = {https://github.com/google/magika}
}

Security vulnerabilities

Please contact us directly at [email protected]

License

Apache 2.0; see LICENSE for details.

Disclaimer

This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.

More Repositories

1

material-design-icons

Material Design icons by Google (Material Symbols)
50,560
star
2

guava

Google core libraries for Java
Java
48,313
star
3

zx

A tool for writing better scripts
JavaScript
42,760
star
4

styleguide

Style guides for Google-originated open-source projects
HTML
37,420
star
5

leveldb

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
C++
36,205
star
6

googletest

GoogleTest - Google Testing and Mocking Framework
C++
34,040
star
7

material-design-lite

Material Design Components in HTML/CSS/JS
HTML
32,281
star
8

comprehensive-rust

This is the Rust course used by the Android team at Google. It provides you the material to quickly teach Rust.
Rust
27,842
star
9

python-fire

Python Fire is a library for automatically generating command line interfaces (CLIs) from absolutely any Python object.
Python
26,842
star
10

mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
C++
25,626
star
11

gson

A Java serialization/deserialization library to convert Java Objects into JSON and back
Java
23,317
star
12

flatbuffers

FlatBuffers: Memory Efficient Serialization Library
C++
23,037
star
13

iosched

The Google I/O Android App
Kotlin
21,772
star
14

ExoPlayer

This project is deprecated and stale. The latest ExoPlayer code is available in https://github.com/androidx/media
Java
21,710
star
15

eng-practices

Google's Engineering Practices documentation
19,942
star
16

web-starter-kit

Web Starter Kit - a workflow for multi-device websites
HTML
18,422
star
17

flexbox-layout

Flexbox for Android
Kotlin
18,230
star
18

fonts

Font files available from Google Fonts, and a public issue tracker for all things Google Fonts
HTML
18,222
star
19

filament

Filament is a real-time physically based rendering engine for Android, iOS, Windows, Linux, macOS, and WebGL2
C++
17,554
star
20

cadvisor

Analyzes resource usage and performance characteristics of running containers.
Go
17,078
star
21

gvisor

Application Kernel for Containers
Go
15,733
star
22

libphonenumber

Google's common Java, C++ and JavaScript library for parsing, formatting, and validating international phone numbers.
C++
15,728
star
23

WebFundamentals

Former git repo for WebFundamentals on developers.google.com
JavaScript
13,851
star
24

yapf

A formatter for Python files
Python
13,755
star
25

brotli

Brotli compression format
TypeScript
13,363
star
26

tink

Tink is a multi-language, cross-platform, open source library that provides cryptographic APIs that are secure, easy to use correctly, and hard(er) to misuse.
Java
13,318
star
27

deepdream

13,212
star
28

wire

Compile-time Dependency Injection for Go
Go
12,919
star
29

guetzli

Perceptual JPEG encoder
C++
12,917
star
30

guice

Guice (pronounced 'juice') is a lightweight dependency injection framework for Java 11 and above, brought to you by Google.
Java
12,458
star
31

blockly

The web-based visual programming editor.
TypeScript
12,392
star
32

sanitizers

AddressSanitizer, ThreadSanitizer, MemorySanitizer
C
11,410
star
33

or-tools

Google's Operations Research tools:
C++
11,144
star
34

dopamine

Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.
Jupyter Notebook
10,529
star
35

grumpy

Grumpy is a Python to Go source code transcompiler and runtime.
Go
10,464
star
36

oss-fuzz

OSS-Fuzz - continuous fuzzing for open source software.
Shell
10,389
star
37

auto

A collection of source code generators for Java.
Java
10,234
star
38

go-github

Go library for accessing the GitHub v3 API
Go
10,206
star
39

go-cloud

The Go Cloud Development Kit (Go CDK): A library and tools for open cloud development in Go.
Go
9,546
star
40

sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
C++
8,657
star
41

tsunami-security-scanner

Tsunami is a general purpose network security scanner with an extensible plugin system for detecting high severity vulnerabilities with high confidence.
Java
8,232
star
42

re2

RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.
C++
8,190
star
43

traceur-compiler

Traceur is a JavaScript.next-to-JavaScript-of-today compiler
JavaScript
8,173
star
44

trax

Trax โ€” Deep Learning with Clear Code and Speed
Python
8,051
star
45

pprof

pprof is a tool for visualization and analysis of profiling data
Go
7,875
star
46

skia

Skia is a complete 2D graphic library for drawing Text, Geometries, and Images.
C++
7,874
star
47

benchmark

A microbenchmark support library
C++
7,812
star
48

android-classyshark

Android and Java bytecode viewer
Java
7,492
star
49

accompanist

A collection of extension libraries for Jetpack Compose
Kotlin
7,442
star
50

closure-compiler

A JavaScript checker and optimizer.
Java
7,394
star
51

agera

Reactive Programming for Android
Java
7,227
star
52

latexify_py

A library to generate LaTeX expression from Python code.
Python
7,160
star
53

diff-match-patch

Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.
Python
7,132
star
54

flutter-desktop-embedding

Experimental plugins for Flutter for Desktop
C++
7,102
star
55

glog

C++ implementation of the Google logging module
C++
7,017
star
56

jsonnet

Jsonnet - The data templating language
Jsonnet
6,938
star
57

model-viewer

Easily display interactive 3D models on the web and in AR!
TypeScript
6,858
star
58

lovefield

Lovefield is a relational database for web apps. Written in JavaScript, works cross-browser. Provides SQL-like APIs that are fast, safe, and easy to use.
JavaScript
6,847
star
59

error-prone

Catch common Java mistakes as compile-time errors
Java
6,818
star
60

draco

Draco is a library for compressing and decompressing 3D geometric meshes and point clouds. It is intended to improve the storage and transmission of 3D graphics.
C++
6,459
star
61

gops

A tool to list and diagnose Go processes currently running on your system
Go
6,375
star
62

gopacket

Provides packet processing capabilities for Go
Go
6,289
star
63

automl

Google Brain AutoML
Jupyter Notebook
6,230
star
64

osv-scanner

Vulnerability scanner written in Go which uses the data provided by https://osv.dev
Go
6,222
star
65

flax

Flax is a neural network library for JAX that is designed for flexibility.
Jupyter Notebook
6,085
star
66

grafika

Grafika test app
Java
6,071
star
67

snappy

A fast compressor/decompressor
C++
6,068
star
68

physical-web

The Physical Web: walk up and use anything
Java
6,017
star
69

j2objc

A Java to iOS Objective-C translation tool and runtime.
Java
5,990
star
70

gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models.
C++
5,961
star
71

ios-webkit-debug-proxy

A DevTools proxy (Chrome Remote Debugging Protocol) for iOS devices (Safari Remote Web Inspector).
C
5,918
star
72

seesaw

Seesaw v2 is a Linux Virtual Server (LVS) based load balancing platform.
Go
5,634
star
73

EarlGrey

๐Ÿต iOS UI Automation Test Framework
Objective-C
5,616
star
74

seq2seq

A general-purpose encoder-decoder framework for Tensorflow
Python
5,577
star
75

google-java-format

Reformats Java source code to comply with Google Java Style.
Java
5,538
star
76

mesop

Rapidly build AI apps in Python
Python
5,401
star
77

wireit

Wireit upgrades your npm/pnpm/yarn scripts to make them smarter and more efficient.
TypeScript
5,385
star
78

syzkaller

syzkaller is an unsupervised coverage-guided kernel fuzzer
Go
5,350
star
79

uuid

Go package for UUIDs based on RFC 4122 and DCE 1.1: Authentication and Security Services.
Go
5,284
star
80

clusterfuzz

Scalable fuzzing infrastructure.
Python
5,283
star
81

battery-historian

Battery Historian is a tool to analyze battery consumers using Android "bugreport" files.
Go
5,249
star
82

gemma_pytorch

The official PyTorch implementation of Google's Gemma models
Python
5,242
star
83

bbr

5,156
star
84

gumbo-parser

An HTML5 parsing library in pure C99
HTML
5,141
star
85

git-appraise

Distributed code review system for Git repos
Go
5,122
star
86

google-authenticator

Open source version of Google Authenticator (except the Android app)
Java
5,077
star
87

gts

โ˜‚๏ธ TypeScript style guide, formatter, and linter.
TypeScript
5,071
star
88

closure-library

Google's common JavaScript library
JavaScript
4,881
star
89

grr

GRR Rapid Response: remote live forensics for incident response
Python
4,757
star
90

cameraview

[DEPRECATED] Easily integrate Camera features into your Android app
Java
4,734
star
91

pytype

A static type analyzer for Python code
Python
4,731
star
92

liquidfun

2D physics engine for games
C++
4,559
star
93

clasp

๐Ÿ”— Command Line Apps Script Projects
TypeScript
4,525
star
94

google-ctf

Google CTF
Python
4,477
star
95

gxui

An experimental Go cross platform UI library.
Go
4,450
star
96

santa

A binary authorization and monitoring system for macOS
Objective-C++
4,402
star
97

bloaty

Bloaty: a size profiler for binaries
C++
4,386
star
98

tcmalloc

C++
4,339
star
99

ko

Build and deploy Go applications on Kubernetes
Go
4,329
star
100

orbit

C/C++ Performance Profiler
C++
4,190
star