• Stars
    star
    248
  • Rank 163,560 (Top 4 %)
  • Language
    Python
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ziplm

Useless but mildly interesting language model using compressors built-in to Python.

Usage

You can "train" it using some training data:

data = open(my_favorite_text_file).read().lower()
alphabet = "qwertyuiopasdfghjklzxcvbnm,.;1234567890 "
model = ziplm.ZipModel(alphabet, training=data)
"".join(model.sample_sequence(10)) # sample 10 characters from the alphabet

You can also run it without any training data, and just forward sample to see what kinds of patterns gzip likes:

alphabet = "abc"
model = ziplm.ZipModel(alphabet)
"".join(model.sample_sequence(100)) # I get 'ccabcabcabcabcabcabcabcabcabcabcabcabcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbcbccabcbcbcbc'

You can also get the probability for a sequence:

alphabet = "qwertyuiopasdfghjklzxcvbnm "
model = ziplm.ZipModel(alphabet)
model.sequence_logprob("this is my favorite string") # I get -83.8

You can also try using bz2 and lzma as language models by passing them as the compressor argument to the model

import lzma
model = ziplm.ZipModel(alphabet, compressor=lzma)
"".join(model.sample_sequence(100)) # I get 'cccbaaaaacccccabcacccbaaaaabaacaabaacaabaacaabaabacaaaaaaaaaaacccbabacaaaaaaaaaaaccccacaaccbaaaaaccc'

Why does this "work"?

This works because of two facts:

  1. A language model is nothing but a distribution on the next token given previous tokens, $p(x \mid c)$.
  2. There is a general equivalence between probability distributions and codes.

The second point is what makes this interesting. Information theory tells us that we can derive codes from probability distributions. That is, if I have some datapoints $x$, and I know that they follow probability distribution $p(x)$, I can come up with a lossless binary code to encode the $x$ where the length of each code is $-\log_2 p(x)$. This code minimizes the average code length: the only way to get shorter average code length would be to go into the realm of lossy compression. This is called the Shannon Limit.

Since I can convert probability distributions to codes in this way, I can also convert codes to probability distributions. If I have a code (like gzip) that describes my datapoint with length $l(x)$ in binary, then that corresponds to a probability distribution $p(x) = 2^{-l(x)}$. If the code is $K$-ary, then the corresponding distribution is $$p(x) = K^{-l(x)}.$$

The ZipLM model works by converting code lengths to probabilities in this way. If I have a vocabulary of size $K$, and a string $c$, then the probability distribution for continuations $x$ is: $$p(x \mid c) \propto K^{-l(cx)},$$ where the proportionality reflects the fact that we have to sum over the compressed lengths of $cx^\prime$ for all $x^\prime$ in the vocabulary. That's all there is to it.

How well does it work?

It's pretty bad, but it doesn't generate total junk. Here I trained the gzip model in Moby Dick---from the Project Gutenberg text---and the output at least has some recognizable parts:

data = open("mobydick.txt").read().lower()
alphabet = "qwertyuiopasdfghjkl;'zxcvbnm,. "
model = ziplm.Model(alphabet, data)
"".join(model.sample_sequence(100)) 

This gives me

"'theudcanvas. ;cm,zumhmcyoetter toauuo long a one aay,;wvbu.mvns. x the dtls and enso.;k.like bla.njv'"

which at least seems to have "long a one" in it.