• Stars
    star
    217
  • Rank 182,446 (Top 4 %)
  • Language
    Julia
  • Created about 5 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A domain-specific probabilistic programming language for scalable Bayesian data cleaning

PClean

Build Status

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Warning: This is a rapidly evolving research prototype.

PClean was created at the MIT Probabilistic Computing Project.

If you use PClean in your research, please cite the our 2021 AISTATS paper:

PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. Lew, A. K.; Agrawal, M.; Sontag, D.; and Mansinghka, V. K. (2021, March). In International Conference on Artificial Intelligence and Statistics (pp. 1927-1935). PMLR. (pdf)

Using PClean

To use PClean, create a Julia file with the following structure:

using PClean
using DataFrames: DataFrame
import CSV

# Load data
data = CSV.File(filepath) |> DataFrame

# Define PClean model
PClean.@model MyModel begin
    @class ClassName1 begin
        ...
    end

    ...
    
    @class ClassNameN begin
        ...
    end
end

# Align column names of CSV with variables in the model.
# Format is ColumnName CleanVariable DirtyVariable, or, if
# there is no corruption for a certain variable, one can omit
# the DirtyVariable.
query = @query MyModel.ClassNameN [
  HospitalName hosp.name             observed_hosp_name
  Condition    metric.condition.desc observed_condition
  ...
]

# Configure observed dataset
observations = [ObservedDataset(query, data)]

# Configuration
config = PClean.InferenceConfig(1, 2; use_mh_instead_of_pg=true)

# SMC initialization
state = initialize_trace(observations, config)

# Rejuvenation sweeps
run_inference!(state, config)

# Evaluate accuracy, if ground truth is available
ground_truth = CSV.File(filepath) |> CSV.DataFrame
results = evaluate_accuracy(data, ground_truth, state, query)

# Can print results.f1, results.precision, results.accuracy, etc.
println(results)

# Even without ground truth, can save the entire latent database to CSV files:
PClean.save_results(dir, dataset_name, state, observations)

Then, from this directory, run the Julia file.

JULIA_PROJECT=. julia my_file.jl

To learn to write a PClean model, see our paper, but note the surface syntax changes described below.

Differences from the paper

As a DSL embedded into Julia, our implementation of the PClean language has some differences, in terms of surface syntax, from the stand-alone syntax presented in our paper:

(1) Instead of latent class C ... end, we write @class C begin ... end.

(2) Instead of subproblem begin ... end, inference hints are given using ordinary Julia begin ... end blocks.

(3) Instead of parameter x ~ d(...), we use @learned x :: D{...}. The set of distributions D for parameters is somewhat restricted.

(4) Instead of x ~ d(...) preferring E, we write x ~ d(..., E).

(5) Instead of observe x as y, ... from C, write @query ModelName.C [x y; ...]. Clauses of the form x z y are also allowed, and tell PClean that the model variable C.z represents a clean version of x, whose observed (dirty) version is modeled as C.y. This is used when automatically reconstructing a clean, flat dataset.

The names of built-in distributions may also be different, e.g. AddTypos instead of typos, and ProportionsParameter instead of dirichlet.

More Repositories

1

Gen.jl

A general-purpose probabilistic programming system with programmable inference
Julia
1,794
star
2

bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
Python
922
star
3

BayesDB

A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself. New implementation in http://github.com/probcomp/bayeslite
889
star
4

crosscat

A domain-general, Bayesian method for analyzing high-dimensional data tables
Python
322
star
5

metaprob

An embedded language for probabilistic programming and meta-programming.
JavaScript
168
star
6

gen-quickstart

Gen learning material as Jupyter notebooks
Jupyter Notebook
128
star
7

LLaMPPL

A domain-specific probabilistic programming language for modeling and inference with language models
Python
110
star
8

hfppl

Probabilistic programming with HuggingFace language models
Python
86
star
9

adev

Haskell prototype to accompany the paper "ADEV: Sound Automatic Differentiation of Expected Values of Probabilistic Programs"
Haskell
64
star
10

sppl

Probabilistic programming system for fast and exact symbolic probabilistic inference
Python
63
star
11

Genify.jl

Automatically convert Julia methods to Gen functions.
Julia
47
star
12

fast-loaded-dice-roller

The Fast Loaded Dice Roller: A Near-Optimal Exact Sampler for Discrete Probability Distributions
C
44
star
13

trcrpm

Temporally-reweighted Chinese restaurant process mixture models for multivariate time series
Jupyter Notebook
37
star
14

Venturecxx

Primary implementation of the Venture probabilistic programming system
C++
28
star
15

cgpm

Library of composable generative population models which serve as the modeling and inference backend of BayesDB.
Python
25
star
16

bayes3d

Jupyter Notebook
22
star
17

GenParticleFilters.jl

Building blocks for simple and advanced particle filtering in Gen.
Julia
21
star
18

GenSMCP3.jl

Automated SMC with Probabilistic Program Proposals, for the Gen PPL.
Julia
19
star
19

GenGPT3.jl

GPT-3 as a generative function in Gen.
Julia
18
star
20

GenExperimental.jl

Featherweight embedded probabilistic programming language and compositional inference programming library
Julia
17
star
21

notebook

jupyter/datascience-notebook with probcomp libraries
Jupyter Notebook
17
star
22

Gen.clj

A general-purpose probabilistic programming system with programmable inference.
Clojure
17
star
23

ThreeDP3

Jupyter Notebook
15
star
24

iventure

An interactive, browser-based probabilistic programming environment.
Python
14
star
25

optimal-approximate-sampling

Optimal Approximate Sampling from Discrete Probability Distributions
Python
14
star
26

autoimcmc

Code accompanying the paper "Automating Involutive MCMC using Probabilistic and Differentiable Programming"
Python
12
star
27

programmable-vi-pldi-2024

Probabilistic programming with programmable variational inference.
Jupyter Notebook
12
star
28

Cloudless

Distributed computational science made easy, in Python
Python
11
star
29

CLIPS.jl

Cooperative Language-Guided Inverse Plan Search (CLIPS).
Julia
11
star
30

GenTF

TensorFlow plugin for Gen probabilistic programming system.
Julia
10
star
31

haskell-trace-types

Prototype of the system described in "Trace Types and Denotational Semantics for Sound Programmable Inference in Probabilistic Languages"
Haskell
10
star
32

developer

Developer environment for probcomp repos
Makefile
9
star
33

bdbcontrib

BayesDB contributions, including plotting, helper methods, and examples
Python
9
star
34

ADEV.jl

Experimental port of ADEV to Julia
Julia
9
star
35

GenViz

A visualization library for probabilistic programming in Gen.
Julia
7
star
36

pldi2019-gen-experiments

Experiments for PLDI 2019 submission on Gen
Jupyter Notebook
7
star
37

InversePlanning.jl

Agent modeling and inverse planning, using PDDL and Gen.
Julia
7
star
38

b3d

Bayes3D
Jupyter Notebook
7
star
39

haxcat

Experimental educational implementation of CrossCat in Haskell
Haskell
6
star
40

SPPL.jl

A small DSL for programming sppl across PythonCall.jl
Julia
6
star
41

packaging

Packaging for probcomp software.
Python
5
star
42

PoseComposition.jl

Julia
5
star
43

GenVariableElimination.jl

Experimental package for variable elimination in factor graphs derived from generative functions
Julia
5
star
44

SpikingInferenceCircuits.jl

Julia
5
star
45

GenDistributions.jl

Use Distributions.jl distributions from within Gen
Julia
5
star
46

GenTraceKernelDSL.jl

A DSL for defining stochastic maps between traces of Gen generative functions
Julia
5
star
47

gen-finance

Clojure
5
star
48

GenPyTorch.jl

Gen plugin to allow PyTorch computations to be used as Gen generative functions.
Julia
5
star
49

probcomp-stack

MIT Probabilistic Computing Project software stack
Shell
4
star
50

GenSP.jl

Probabilistic programming library extending Gen with support for Stochastic Probabilities
Julia
4
star
51

Gen2DAgentMotion.jl

Components for building generative models of the motion of an agent moving around a 2D environment.
Julia
4
star
52

GenExamples.jl

Gen examples with a Travis CI build that tests that they run
Julia
3
star
53

GenFlux.jl

Julia
3
star
54

InverseGraphics

Jupyter Notebook
3
star
55

curve-fitting

A simple application demonstrating some of the capabilities of the Metaprob probabilistic programming language
Clojure
3
star
56

bayesrest

Python
3
star
57

cgpm2

Minimal implementation of composable generative population models for Bayesian synthesis of probabilistic programs.
Jupyter Notebook
3
star
58

TracedRandom.jl

Make Julia code probabilistic-programming-ready by allowing calls to `rand` to be annotated with traced addresses.
Julia
3
star
59

nips2017-aide-experiments

Experiments and figure generation for NIPS 2017 paper on AIDE
Julia
3
star
60

parallel_map

Simple parallel mapping utility for Python 3.
Python
2
star
61

gen-examples-perception

Examples of Gen applied to perception problems
Julia
2
star
62

GenFluxOptimizers.jl

A Gen plugin for using Flux's optimizers to fit a probabilistic program's parameters
Julia
2
star
63

aistats2023-smcp3

Julia
2
star
64

Circuits.jl

Julia
2
star
65

GenDirectionalStats.jl

Distributions on spaces of rotations and other spatial spaces.
Julia
2
star
66

tutorial_highlighter

Python package for generating PNGs of code and math with custom highlighted regions using LaTeX
Python
2
star
67

ravi-uai-2022

Code to accompany the paper "Recursive Monte Carlo and Variational Inference with Auxiliary Variables"
Julia
2
star
68

SMC.jl

A Julia implementation of generic sequential Monte Carlo (SMC) and conditional SMC.
Julia
1
star
69

inferenceql.viz

Clojure
1
star
70

DynamicForwardDiff.jl

An experimental fork of ForwardDiff.jl to support differentiation with respect to an a-priori unknown number of parameters
Julia
1
star
71

GenRedner.jl

Gen.jl wrapper for the Redner differentiable renderer
Julia
1
star
72

bayeslite-apsw

C
1
star
73

GenPOMDPs.jl

Julia
1
star
74

GLRenderer.jl

High FPS rendering. Supports Depth, RGB, and RGB+Texture
Julia
1
star
75

DepthRenderer

Minimal OpenGL-based 3D depth renderer in Julia
Julia
1
star
76

durablevs

DURableVS: Data-efficient Unsupervised Recalibrating Visual Servoing via online learning in a structured generative model
Jupyter Notebook
1
star
77

JAX.jl

A wrapper package for using JAX from Julia via PythonCall.
Julia
1
star
78

GenPseudoMarginal.jl

Sequential Monte Carlo and annealed importance sampling inference library for Gen
Julia
1
star