• Stars
    star
    234
  • Rank 171,630 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

๐Ÿงฎ Bayesian networks in Python

sorobn โ€” Bayesian networks in Python


DALLยทE 2023-03-17 09 21 56 - An oil painting by Matisse of a Bayesian network  Each node in the network is an abacus with red balls and a wooden frame

This is an unambitious Python library for working with Bayesian networks. For serious usage, you should probably be using a more established project, such as pomegranate, pgmpy, bnlearn (which is built on the latter), or even PyMC. There's also the well-documented bnlearn package in R. Hey, you could even go medieval and use something like Netica โ€” I'm just jesting, they actually have a nice tutorial on Bayesian networks. By the way, if you're not familiar with Bayesian networks, then I highly recommend Patrick Winston's MIT courses on probabilistic inference (part 1, part 2).

The main goal of this project is to be used for educational purposes. As such, more emphasis is put on tidyness and conciseness than on performance. I find libraries such as pomegranate are wonderful. But, they literally contain several thousand lines of non-obvious code, at the detriment of simplicity and ease of comprehension. I've also put some effort into designing a slick API that makes full use of pandas. Although performance is not the main focus of this library, it is reasonably efficient and should be able to satisfy most use cases in a timely manner.

Table of contents

Installation

You should be able to install and use this library with any Python version above 3.9:

pip install sorobn

Note that under the hood, sorobn uses vose for random sampling, which is written in Cython.

Usage

โœ๏ธ Manual structures

The central construct in sorobn is the BayesNet class. A Bayesian network's structure can be manually defined by instantiating a BayesNet. As an example, let's use Judea Pearl's famous alarm network:

>>> import sorobn as hh

>>> bn = hh.BayesNet(
...     ('Burglary', 'Alarm'),
...     ('Earthquake', 'Alarm'),
...     ('Alarm', 'John calls'),
...     ('Alarm', 'Mary calls'),
...     seed=42,
... )

You may also use the following notation, which is slightly more terse:

>>> import sorobn as hh

>>> bn = hh.BayesNet(
...     (['Burglary', 'Earthquake'], 'Alarm'),
...     ('Alarm', ['John calls', 'Mary calls']),
...     seed=42
... )

In Judea Pearl's example, the conditional probability tables are given. Therefore, we can define them manually by setting the values of the P attribute:

>>> import pandas as pd

# P(Burglary)
>>> bn.P['Burglary'] = pd.Series({False: .999, True: .001})

# P(Earthquake)
>>> bn.P['Earthquake'] = pd.Series({False: .998, True: .002})

# P(Alarm | Burglary, Earthquake)
>>> bn.P['Alarm'] = pd.Series({
...     (True, True, True): .95,
...     (True, True, False): .05,
...
...     (True, False, True): .94,
...     (True, False, False): .06,
...
...     (False, True, True): .29,
...     (False, True, False): .71,
...
...     (False, False, True): .001,
...     (False, False, False): .999
... })

# P(John calls | Alarm)
>>> bn.P['John calls'] = pd.Series({
...     (True, True): .9,
...     (True, False): .1,
...     (False, True): .05,
...     (False, False): .95
... })

# P(Mary calls | Alarm)
>>> bn.P['Mary calls'] = pd.Series({
...     (True, True): .7,
...     (True, False): .3,
...     (False, True): .01,
...     (False, False): .99
... })

The prepare method has to be called whenever the structure and/or the P are manually specified. This will do some house-keeping and make sure everything is sound. It is not compulsory but highly recommended, just like brushing your teeth.

>>> bn.prepare()

Note that you are allowed to specify variables that have no dependencies with any other variable:

>>> _ = hh.BayesNet(
...     ('Cloud', 'Rain'),
...     (['Rain', 'Cold'], 'Snow'),
...     'Wind speed'  # has no dependencies
... )

๐ŸŽฒ Random sampling

You can use a Bayesian network to generate random samples. The samples will follow the distribution induced by the network's structure and its conditional probability tables.

>>> from pprint import pprint

>>> pprint(bn.sample())
{'Alarm': False,
 'Burglary': False,
 'Earthquake': False,
 'John calls': False,
 'Mary calls': False}

>>> bn.sample(5)  # doctest: +SKIP
    Alarm  Burglary  Earthquake  John calls  Mary calls
0  False     False       False       False       False
1  False     False       False       False       False
2  False     False       False       False       False
3  False     False       False       False       False
4  False     False       False        True       False

You can also specify starting values for a subset of the variables.

>>> pprint(bn.sample(init={'Alarm': True, 'Burglary': True}))
{'Alarm': True,
 'Burglary': True,
 'Earthquake': False,
 'John calls': True,
 'Mary calls': False}

The supported inference methods are:

Note that randomness is controlled via the seed parameter, when BayesNet is initialized.

๐Ÿ”ฎ Probabilistic inference

A Bayesian network is a generative model. Therefore, it can be used for many purposes. For instance, it can answer probabilistic queries, such as:

What is the likelihood of there being a burglary if both John and Mary call?

This question can be answered by using the query method, which returns the probability distribution for the possible outcomes. Said otherwise, the query method can be used to look at a query variable's distribution conditioned on a given event. This can be denoted as P(query | event).

>>> bn.query('Burglary', event={'Mary calls': True, 'John calls': True})
Burglary
False    0.715828
True     0.284172
Name: P(Burglary), dtype: float64

We can also answer questions that involve multiple query variables, for instance:

What are the chances that John and Mary call if an earthquake happens?

>>> bn.query('John calls', 'Mary calls', event={'Earthquake': True})
John calls  Mary calls
False       False         0.675854
            True          0.027085
True        False         0.113591
            True          0.183470
Name: P(John calls, Mary calls), dtype: float64

By default, the answer is found via an exact inference procedure. For small networks this isn't very expensive to perform. However, for larger networks, you might want to prefer using approximate inference. The latter is a class of methods that randomly sample the network and return an estimate of the answer. The quality of the estimate increases with the number of iterations that are performed. For instance, you can use Gibbs sampling:

>>> bn.query(
...     'Burglary',
...     event={'Mary calls': True, 'John calls': True},
...     algorithm='gibbs',
...     n_iterations=1000
... )  # doctest: +SKIP
Burglary
False    0.706
True     0.294
Name: P(Burglary), dtype: float64

The supported inference methods are:

As with random sampling, randomness is controlled during BayesNet initialization, via the seed parameter.

โ“ Missing value imputation

A use case for probabilistic inference is to impute missing values. The impute method fills the missing values with the most likely replacements, given the present information. This is usually more accurate than simply replacing by the mean or the most common value. Additionally, such an approach can be much more efficient than model-based iterative imputation.

>>> sample = {
...     'Alarm': True,
...     'Burglary': True,
...     'Earthquake': False,
...     'John calls': None,  # missing
...     'Mary calls': None   # missing
... }

>>> sample = bn.impute(sample)
>>> pprint(sample)
{'Alarm': True,
 'Burglary': True,
 'Earthquake': False,
 'John calls': True,
 'Mary calls': True}

Note that the impute method can be seen as the equivalent of pomegranate's predict method.

๐Ÿคท Likelihood estimation

You can estimate the likelihood of an event with the predict_proba method:

>>> event = {
...     'Alarm': False,
...     'Burglary': False,
...     'Earthquake': False,
...     'John calls': False,
...     'Mary calls': False
... }

>>> bn.predict_proba(event)
0.936742...

In other words, predict_proba computes P(event), whereas the query method computes P(query | event). You may also estimate the likelihood for a partial event. The probabilities for the unobserved variables will be summed out.

>>> event = {'Alarm': True, 'Burglary': False}
>>> bn.predict_proba(event)
0.001576...

This also works for an event with a single variable:

>>> event = {'Alarm': False}
>>> bn.predict_proba(event)
0.997483...

Note that you can also pass a bunch of events to predict_proba, as so:

>>> events = pd.DataFrame([
...     {'Alarm': False, 'Burglary': False, 'Earthquake': False,
...      'John calls': False, 'Mary calls': False},
...
...     {'Alarm': False, 'Burglary': False, 'Earthquake': False,
...      'John calls': True, 'Mary calls': False},
...
...     {'Alarm': True, 'Burglary': True, 'Earthquake': True,
...      'John calls': True, 'Mary calls': True}
... ])

>>> bn.predict_proba(events)
Alarm  Burglary  Earthquake  John calls  Mary calls
False  False     False       False       False         0.936743
                             True        False         0.049302
True   True      True        True        True          0.000001
Name: P(Alarm, Burglary, Earthquake, John calls, Mary calls), dtype: float64

๐Ÿงฎ Parameter estimation

You can determine the values of the P from a dataset. This is a straightforward procedure, as it only requires performing a groupby followed by a value_counts for each CPT.

>>> samples = bn.sample(1000)
>>> bn = bn.fit(samples)

Note that in this case you do not have to call the prepare method because it is done for you implicitly.

If you want to update an already existing Bayesian networks with new observations, then you can use partial_fit:

>>> bn = bn.partial_fit(samples[:500])
>>> bn = bn.partial_fit(samples[500:])

The same result will be obtained whether you use fit once or partial_fit multiple times in succession.

๐Ÿงฑ Structure learning

๐ŸŒณ Chow-Liu trees

A Chow-Liu tree is a tree structure that represents a factorised distribution with maximal likelihood. It's essentially the best tree structure that can be found.

>>> samples = hh.examples.asia().sample(300)
>>> structure = hh.structure.chow_liu(samples)
>>> bn = hh.BayesNet(*structure)

๐Ÿ‘€ Visualization

You can use the graphviz method to obtain a graphviz.Digraph representation.

>>> bn = hh.examples.asia()
>>> dot = bn.graphviz()
>>> path = dot.render('asia', directory='figures', format='svg', cleanup=True)


Note that the graphviz library is not installed by default because it requires a platform dependent binary. Therefore, you have to install it by yourself.

๐Ÿ‘๏ธ Graphical user interface

A side-goal of this project is to provide a user interface to play around with a given user interface. Fortunately, we live in wonderful times where many powerful and opensource tools are available. At the moment, I have a preference for streamlit.

You can install the GUI dependencies by running the following command:

$ pip install git+https://github.com/MaxHalford/sorobn --install-option="--extras-require=gui"

You can then launch a demo by running the sorobn command:

$ sorobn

This will launch a streamlit interface where you can play around with the examples that sorobn provides. You can see a running instance of it in this Streamlit app.

An obvious next step would be to allow users to run this with their own Bayesian networks. Then again, using streamlit is so easy that you might as well do this yourself.

๐Ÿ”ข Support for continuous variables

Bayesian networks that handle both discrete and continuous are said to be hybrid. There are two approaches to deal with continuous variables. The first approach is to use parametric distributions within nodes that pertain to a continuous variable. This has two disadvantages. First, it is complex because there are different cases to handle: a discrete variable conditioned by a continuous one, a continuous variable conditioned by a discrete one, or combinations of the former with the latter. Secondly, such an approach requires having to pick a parametric distribution for each variable. Although there are methods to automate this choice for you, they are expensive and are far from being foolproof.

The second approach is to simply discretize the continuous variables. Although this might seem naive, it is generally a good enough approach and definitely makes things simpler implementation-wise. There are many ways to go about discretising a continuous attribute. For instance, you can apply a quantile-based discretization function. You could also round each number to its closest integer. In some cases you might be able to apply a manual rule. For instance, you can convert a numeric temperature to "cold", "mild", and "hot".

To summarize, we prefer to give the user the flexibility to discretize the variables by herself. Indeed, most of the time the best procedure depends on the problem at hand and cannot be automated adequately.

Toy networks

Several toy networks are available to fool around with in the examples submodule:

Here is some example usage:

>>> bn = hh.examples.sprinkler()

>>> bn.nodes
['Cloudy', 'Rain', 'Sprinkler', 'Wet grass']

>>> pprint(bn.parents)
{'Rain': ['Cloudy'],
 'Sprinkler': ['Cloudy'],
 'Wet grass': ['Rain', 'Sprinkler']}

>>> pprint(bn.children)
{'Cloudy': ['Rain', 'Sprinkler'],
 'Rain': ['Wet grass'],
 'Sprinkler': ['Wet grass']}

Development

# Download and navigate to the source code
git clone https://github.com/MaxHalford/sorobn
cd sorobn

# Install poetry
curl -sSL https://install.python-poetry.org | python3 -

# Install in development mode
poetry install

# Run tests
poetry shell
pytest

License

This project is free and open-source software licensed under the MIT license.

More Repositories

1

prince

๐Ÿ‘‘ Multivariate exploratory data analysis in Python โ€” PCA, CA, MCA, MFA, FAMD, GPA
Python
1,245
star
2

eaopt

๐Ÿ€ Evolutionary optimization library for Go (genetic algorithm, partical swarm optimization, differential evolution)
Go
881
star
3

xam

๐ŸŽฏ Personal data science and machine learning toolbox
Python
362
star
4

flask-boilerplate

๐Ÿš€ Fully fledged Flask boilerplate code
Python
354
star
5

chime

๐ŸŽต Python sound notifications made easy
Python
294
star
6

kaggle-recruit-restaurant

๐Ÿ† Kaggle 8th place solution
Jupyter Notebook
106
star
7

procedural-art

๐ŸŒŒ Procedural art with vanilla JavaScript
HTML
96
star
8

pytorch-resample

๐ŸŽฒ Iterable dataset resampling in PyTorch
Python
89
star
9

xgp

๐Ÿ”ฎ Symbolic regression library
Go
61
star
10

flask-sse-no-deps

An example of server-sent events in Flask without extra dependencies
Python
59
star
11

clavier

๐Ÿ”ค Measure edit distance based on keyboard layout
Python
58
star
12

halfgone

๐Ÿ”ณ Black and white digital halftoning
Go
47
star
13

taxi-demo-rp-mz-rv-rd-st

๐Ÿš• Self-contained demo using Redpanda, Materialize, River, Redis, and Streamlit to predict taxi trip durations
Python
44
star
14

pointu

โœ๏ธ Pointillisme tool based on Weighted Voronoi Stippling
Go
37
star
15

carre

๐Ÿ‘Œ Image simplifier
Go
33
star
16

kaggle-vsb-power

โšก 13th place solution
Jupyter Notebook
31
star
17

arcgonaut

๐ŸŒ€ Golang arc diagrams
Go
29
star
18

eaopt-examples

๐Ÿ€ eaopt examples
Go
28
star
19

starboost

โญ๐Ÿš€ Gradient boosting on steroids
Python
26
star
20

naked

The simplest way to deploy a machine learning model
Python
23
star
21

idao-2020-qualifier

Solution of team "Data O Plomo" to the qualification phase of the 2020 edition of the International Data Analysis Olympiad (IDAO)
Jupyter Notebook
18
star
22

genetic-curve-fitting

๐Ÿ“ˆ
Python
17
star
23

orc

๐ŸงŒ Parsing structured information from OCR outputs
Jupyter Notebook
17
star
24

myriade

โœจ๐ŸŒฒ Hierarchical extreme multiclass and multi-label classification.
Python
16
star
25

bike-sharing-history

๐Ÿšฒ Git scraping for bike sharing APIs
Python
16
star
26

data-science-tutorials

Jupyter Notebook
15
star
27

maxhalford.github.io

๐Ÿก Personal website
HTML
13
star
28

tuna

๐ŸŸ A streaming ETL for fish
Go
13
star
29

bbc-weather-honolulu

โ˜€๏ธ Measuring the accuracy of BBC weather forecasts in Honolulu, USA
Python
12
star
30

yamp

Yet Another MkDocs Parser
Python
11
star
31

tartine

๐Ÿž Manipulate dynamic spreadsheets with arbitrary layouts using Python
Python
11
star
32

spotgeo-challenge

๐Ÿ›ฐ๏ธ My solution to the Kelvins spotGEO challenge
Python
10
star
33

openbikes

๐Ÿšฒ Collecting and publishing bike sharing data stored at https://github.com/MaxHalford/openbikes-data
Python
9
star
34

gago

Old version of eaopt, will eventually be removed
Go
9
star
35

project-euler-python

๐Ÿ
Python
9
star
36

ikea-store-locations

๐Ÿ‡ธ๐Ÿ‡ช Retrieval and analysis of IKEA store locations
Python
9
star
37

directory-architecture

๐Ÿ“ Mimicking the tree command
Python
8
star
38

xgp-python

XGP Python package with a scikit-learn interface
Python
8
star
39

idao-2020-final

Solution of team "Data O Plomo" to the final phase of the 2020 edition of the International Data Analysis Olympiad (IDAO)
Jupyter Notebook
7
star
40

svg2stl

๐Ÿ›น Turn an SVG into an STL for stencil creation purposes
Python
6
star
41

inverted-index-search-engine

Python
5
star
42

streaming-cdf-benchmark

A benchmark to compare algorithms for estimating cumulative density functions (CDF) on streaming data
Python
5
star
43

vose

Cython implementation of Vose's Alias method
Cython
5
star
44

jan

๐Ÿ’ค Just Another Neural network
Python
5
star
45

bitcoin-analysis-m1sid

๐Ÿ’ฐ Master 1 project
Python
5
star
46

kaggle-march-madness-2019

๐Ÿ€ Men and women solutions for the 2019 edition of the Kaggle March Madness competition
Jupyter Notebook
4
star
47

kaggle-DSG18-qualifier

Jupyter Notebook
3
star
48

kaggle-DSG17-qualifier

Python
3
star
49

kaggle-plasticc-astro-classification

Jupyter Notebook
3
star
50

andor-faq-llm

๐ŸŽฒ Answering tabletop game questions using an LLM
Python
2
star
51

kaggle-answer-correctness

๐Ÿค” Solution to the Riiid! Answer Correctness Prediction competition on Kaggle
Python
2
star
52

postgres-job-docker

๐Ÿณ Docker setup for PostgreSQL + Join Order Benchmark (JOB)
Shell
2
star
53

dotfiles

๐Ÿง˜ Because it's the healthy thing to do
Shell
2
star
54

ziboinboin.com

๐Ÿ‚ Old Ziboinboin website
HTML
1
star
55

openbikes-data

๐Ÿšฒ Git storage for https://github.com/MaxHalford/openbikes
1
star
56

where-to-live

Jupyter Notebook
1
star
57

cochleas-L3SID

Python
1
star
58

chrome-infinite-scrolling-robot

JavaScript
1
star
59

kaggle-avito-demand

Python
1
star
60

tldks-2020

Jupyter Notebook
1
star