Prolly
Prolly is a Domain Specific Language (DSL) for expressing probabilities in code. Just like a database has a query language (SQL), this is a query language specifically for answering questions about probabilities of events based on the samples you've seen before.
So instead of counting all the events yourself, you just express probabilities, entropies, and information gain much like how math books express it. Being able to express probabilities is useful for writing machine learning algorithms at a higher level of abstraction. The right level abstraction makes things easier to build.
We can now making decisions in code not just based on the current data, like if
statements do, but we can make decisions based on the chance of prior data and
the current data, and that makes for smarter software.
What can I use this for?
There are examples of using Prolly to write learning algorithms.
Quick intro
Prolly makes it easy to express probabilities from data. It can also calculate entropies of random variables as well as the information gain.
Here's how to express Bayes Rule in Prolly:
Ps.rv(color: blue).given(size: red).prob * Ps.rv(size: red).prob
/ Ps.rv(color: blue).prob
And the above will calculate P(Size=red | Color= blue)
Installing
Use ruby gems to install
gem install prolly
If you use Bundler, just add it to your Gemfile, and then run bundle install
Usage
We first add samples of observable events to be able to estimate the probability of the events we've seen. Then we can query it with Prolly to know the probability of different events.
Adding samples
Now we add the samples of data that we've observed for the random variable. Presumably, we have a large enough dataset that we can reasonably estimate each specified RV.
require 'prolly'
include Prolly
Ps.add({ color: :blue, size: :small })
Ps.add({ color: :blue, size: :big })
Ps.add({ color: :blue, size: :big })
Ps.add({ color: :green, size: :big })
Ps.add({ color: :green, size: :small })
Now that we have samples to estimate our probabilities, we're good to go on how to express them.
Note that you need you'll need to
include Prolly
into whatever namespace you're using it in, in order to callPs.add
. Otherwise, you'll need to type:Prolly::Ps.add
, ifPs
is already taken in your namespace.
Expressing Stochastics through Probability Space
Ps
is short for Probability Space. It's normally denoted by Ω, U (for universal set), or S (for sample set) in probability textbooks. It's the set of all events that could happen.
You start with probability space.
Ps
then pick an specified random variable to examine
Ps.rv(color: :blue)
And if necessary, pick a conditional random variable
Ps.rv(color: :blue).given(size: :small)
Then pick the operation, where it can be count
, prob
, pdf
, entropy
, or infogain
.
Ps.rv(color: :blue).given(size: :small).prob
And that will give you the probability of the random variable Color is :blue given that the Size was :small.
Probabilities
What is the probability there is a blue marble?
# P(C = blue)
Ps.rv(color: :blue).prob
What is the joint probability there is a blue marble that also has a rough texture?
# P(C = blue, T = rough)
Ps.rv(color: :blue, texture: :rough).prob
What is the probability a marble is small or med sized?
# P(S = small, med)
Ps.rv(size: [:small, :med]).prob
What is the probability of a blue marble given that the marble is small?
# P(C = blue | S = small)
Ps.rv(color: :blue).given(size: :small).prob
What is the probability of a blue marble and rough texture given that the marble is small?
# P(C = blue, T = rough | S = small)
Ps.rv(color: :blue, texture: :rough).given(size: :small).prob
Probability density functions
Probability density for a random variable.
Ps.rv(:color).pdf
Probability density for a conditional random variable.
Ps.rv(:color).given(size: :small).pdf
Entropy
Entropy of the RV color.
# H(C)
Ps.rv(:color).entropy
Entropy of color given the marble is small
# H(C | S = small)
Ps.rv(:color).given(size: :small).entropy
Information Gain
Information gain of color and size.
# IG(C | S)
Ps.rv(:color).given(:size).infogain
Information gain of color and size, when we already know texture and opacity.
# IG(C | S, T=smooth, O=opaque)
Ps.rv(:color).given(:size, { texture: :smooth, opacity: :opaque }).infogain
Counts
At the base of all the probabilities are counts of stuff.
Ps.rv(color: :blue).count
Ps.rv(:color).given(:size).count
Full Reference
A random variable can be specified Ps.rv(:color)
or unspecified Ps.rv(color: :blue)
. So too can conditional random variables be specified or unspecified.
Prolly currently supports five operations.
- .prob() · Calculates probability, a fractional number representing the belief you have that an event will occur; based on the amount of evidence you've seen for that event.
- .pdf() · Calculates probability density function, a hash of all possible probabilities for the random variable.
- .entropy() · Calculates entropy, a fractional number representing the spikiness or smoothness of a density function, which implies how much information is in the random variable.
- .infogain() · Calculates information gain, a fractional number representing the amount of information (that is, reduction in uncertainty) that knowing either variable provides about the other.
- .count() · Counts the number of events satisfying the conditions.
Each of the operations will only work with certain combinations of random variables. The possibilities are listed below, and Prolly will throw an exception if it's violated.
Legend:
- ✓ available for this operator
- Δ! available, but not yet implemented for this operator.
The Probability Operator: .prob()
n/a | .given(:size) | .given(size: :small) | .given(size: :small, weight: :fat) | .given(:size, weight: :fat) | .given(:size, :weight) | |
---|---|---|---|---|---|---|
rv(color: :blue) | ✓ | ✓ | ✓ | ✓ | ||
rv(color: [:blue, :green]) | ✓ | |||||
rv(color: :blue, texture: :rough) | ✓ | ✓ | ✓ | ✓ | ||
rv(:color) | ||||||
rv(:color, :texture) |
The Probability Density Function Operator: .pdf()
n/a | .given(:size) | .given(size: :small) | .given(size: :small, weight: :fat) | .given(:size, weight: :fat) | .given(:size, :weight) | |
---|---|---|---|---|---|---|
rv(color: :blue) | ||||||
rv(color: [:blue, :green]) | ||||||
rv(color: :blue, texture: :rough) | ||||||
rv(:color) | ✓ | ✓ | ✓ | ✓ | ||
rv(:color, :texture) | Δ! | Δ! | Δ! | Δ! | Δ! |
The Entropy Operator: .entropy()
n/a | .given(:size) | .given(size: :small) | .given(size: :small, weight: :fat) | .given(:size, weight: :fat) | .given(:size, :weight) | |
---|---|---|---|---|---|---|
rv(color: :blue) | ||||||
rv(color: [:blue, :green]) | ||||||
rv(color: :blue, texture: :rough) | ||||||
rv(:color) | ✓ | ✓ | ✓ | ✓ | ✓ | |
rv(:color, :texture) | ✓ | Δ! | ✓ | Δ! | ✓ |
The Information Gain Operator: .infogain()
n/a | .given(:size) | .given(size: :small) | .given(size: :small, weight: :fat) | .given(:size, weight: :fat) | .given(:size, :weight) | |
---|---|---|---|---|---|---|
rv(color: :blue) | ||||||
rv(color: [:blue, :green]) | ||||||
rv(color: :blue, texture: :rough) | ||||||
rv(:color) | ✓ | ✓ | ||||
rv(:color, :texture) |
The Count Operator: .count()
n/a | .given(:size) | .given(size: :small) | .given(size: :small, weight: :fat) | .given(:size, weight: :fat) | .given(:size, :weight) | |
---|---|---|---|---|---|---|
rv(color: :blue) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
rv(color: [:blue, :green]) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
rv(color: :blue, texture: :rough) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
rv(:color) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
rv(:color, :texture) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Stores
Prolly can use different stores to remember the prior event data from which it calculates the probability. Currently Prolly implements a RubyList store and a Mongodb store.
Implementing new stores
The interface for a new store is pretty easy. It just needs to implement six methods:
initialize
This just brings up the store, and connects to it, and whatever else you need to do in the beginning.
reset
This should just clear the entire store of the data in the collection.
add(datum)
Adds one row of data to the store.
count(rvs, options = {})
Counts the number of samples that satisfy the RVs requested. rvs
can be either an Array or a Hash. When it's an array, you must count all
samples that have all the RVs.
When it's a hash, you must look for all samples that not only have the random variables, but also have the matching designated values. Note that the values can be an array. When that happens, the user is indicating that it also would like any of the values the RV to match.
rand_vars
Return a list of all random variables
uniq_vals(name)
Return a list of all uniq values of a random variable.
Motivation
A couple years back, I was reading a blog post by Raganwald, where I read this quote:
A very senior Microsoft developer who moved to Google told me that Google works and thinks at a higher level of abstraction than Microsoft. “Google uses Bayesian filtering the way Microsoft uses the if statement,” he said.—Joel Spolsky, Microsoft Jet
That got me thinking very literally. What would it look like if we have probability statements to use natively like we have "if" statements? How would that change how we code? That would mean we could make decisions not just on the information we have on hand, but the prior information we saw before.
Contributing
Write some specs, make sure the entire thing passes. Then submit a pull request.
Contributors
- Wil Chung
License
MIT license
Changelog
v0.0.1
- Initial release with counts, probs, pdf, entropy, and infogain.
- implements two stores, RubyList and Mongodb
v0.0.2
- fix: RubyList storage had a bug that counted incorrectly.