• Stars
    star
    146
  • Rank 252,769 (Top 5 %)
  • Language
    Shell
  • License
    Other
  • Created over 13 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

FemtoZip is a "shared dictionary" compression library optimized for small documents that may not compress well with traditional tools such as gzip

FemtoZip

FemtoZip is a "shared dictionary" compression library optimized for small documents that may not compress well with traditional tools such as gzip. In particular, situations where a very large number of small documents (10's to 1000's of bytes) share similar characteristics, but do not compress effectively standalone. Since FemtoZip's creation, Facebook has open sourced Zstd which has support for dictionary compression and is discussed more here.

How can I tell if my data will work with femtozip?

  1. If gzipping 1000 of your documents concatenated together in a single file achieves much better compression rates then individual documents, then your data is likely tailor made for FemtoZip.
  2. Get your documents onto the file system as discrete files, and run a test using the fzip command line tool as shown in the Tutorial.
  3. If you have a Lucene search index and you want to see how much FemtoZip can compress your stored fields, try the IndexAnalyzer

Examples where FemtoZip is likely to outperform gzip:

  1. Small objects serialized and stored in a database or in memory DHT such as memcached using php, json, or xml serialization format. Keys and tags are repeated across documents, but may not be repeated within a document. For example in one large scale consumer website, memcached user objects (via php serialization) were compressed to 29% of their gzipped size (8.3% of their original size). Doblander at al used FemtoZip to prototype such a system.
  2. Urls, for example stored in a Lucene search index. Urls often start with "http://www.", and have common substrings like ".com/", ".html", "?page=". Again this structure is repeated across documents, but not within a document. For example in a large scale search engine urls in Lucene were compressed to 60% of their gzipped size (20% of their original size).

Learn More

To learn more about how FemtoZip works, and how to build and use it, check out the FemtoZip wiki.

More Repositories

1

dqn-atari

A TensorFlow based implementation of the DeepMind Atari playing "Deep Q Learning" agent that works reasonably well
Python
91
star
2

mnist-vae

Semi-supervised learning with mnist using variational autoencoders. An unsupervised representation is learned which allows for superior classification results with limited labels.
Python
31
star
3

mnist-gan

A Generative Adversarial Network (GAN) for generating mnist digits
Python
30
star
4

ProductClassification

A playground for classifying products based on image and text features using deep learning.
HTML
24
star
5

simple-raycasting

A simple implementation of ray casting like those used in early "3D-ish" games like Wolfenstein 3D.
Java
6
star
6

mikado-universe

A javascript simulation of a "mikado universe" as described at http://bit.ly/ji8ai3. The toy universe shows how the macroscopic effect of gravity could be an "emergent" force resulting from the increase of entropy.
JavaScript
6
star
7

active-learning-mnist

Simple example of using Active Learning on MNIST
Jupyter Notebook
4
star
8

simple-autodiff

A simple (but inefficient) auto diff algorithm in python using a "define by run" methodology.
Python
3
star
9

SimpleReinforcementLearning

A demonstration of table based, SARSA reinforcement learning for a simple cat/mouse game
Java
2
star
10

simple-vanity-url-shortener

60 lines of JavaScript, a Google Sheet, a bit of AWS, and a domain gets you your very own URL shortener!
JavaScript
2
star
11

JavaLaunch

JavaLaunch is yet another win32 java launcher designed for max simplicity. Built during creation of https://github.com/gtoubassi/FileBunker
C++
2
star
12

FileBunker

FileBunker is a file backup application which uses one or more GMail accounts as a free, offsite backup repository.
Java
2
star
13

NeuralNet

From scratch Java implementation of the simple handwriting recognition neural net outlined in the first two chapters of neuralnetworksanddeeplearning.com
Java
1
star