• Stars
    star
    321
  • Rank 130,752 (Top 3 %)
  • Language
    Jupyter Notebook
  • License
    Other
  • Created almost 5 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

distfit is a python library for probability density fitting.

Python Pypi Docs LOC Downloads Downloads License Forks Issues Project Status DOI Medium Colab Donate

Blogs

1. How to Find the Best Theoretical Distribution for Your Data

2. Outlier Detection Using Distribution Fitting in Univariate Datasets

3. Step-by-Step Guide to Generate Synthetic Data by Sampling From Univariate Distributions

Documentation pages

distfit is a python package for probability density fitting of univariate distributions for random variables. With the random variable as an input, distfit can find the best fit for parametric, non-parametric, and discrete distributions.

  • For the parametric approach, the distfit library can determine the best fit across 89 theoretical distributions. To score the fit, one of the scoring statistics for the good-of-fitness test can be used used, such as RSS/SSE, Wasserstein, Kolmogorov-Smirnov (KS), or Energy. After finding the best-fitted theoretical distribution, the loc, scale, and arg parameters are returned, such as mean and standard deviation for normal distribution.

  • For the non-parametric approach, the distfit library contains two methods, the quantile and percentile method. Both methods assume that the data does not follow a specific probability distribution. In the case of the quantile method, the quantiles of the data are modeled whereas for the percentile method, the percentiles are modeled.

  • In case the dataset contains discrete values, the distift library contains the option for discrete fitting. The best fit is then derived using the binomial distribution.

⭐️ Star this repo if you like it ⭐️

Installation

Install distfit from PyPI
pip install distfit
Install from github source (beta version)
 install git+https://github.com/erdogant/distfit
Check version
import distfit
print(distfit.__version__)
The following functions are available after installation:
# Import library
from distfit import distfit

dfit = distfit()        # Initialize 
dfit.fit_transform(X)   # Fit distributions on empirical data X
dfit.predict(y)         # Predict the probability of the resonse variables
dfit.plot()             # Plot the best fitted distribution (y is included if prediction is made)

Examples

Example: Quick start to find best fit for your input data
# [distfit] >INFO> fit
# [distfit] >INFO> transform
# [distfit] >INFO> [norm      ] [0.00 sec] [RSS: 0.00108326] [loc=-0.048 scale=1.997]
# [distfit] >INFO> [expon     ] [0.00 sec] [RSS: 0.404237] [loc=-6.897 scale=6.849]
# [distfit] >INFO> [pareto    ] [0.00 sec] [RSS: 0.404237] [loc=-536870918.897 scale=536870912.000]
# [distfit] >INFO> [dweibull  ] [0.06 sec] [RSS: 0.0115552] [loc=-0.031 scale=1.722]
# [distfit] >INFO> [t         ] [0.59 sec] [RSS: 0.00108349] [loc=-0.048 scale=1.997]
# [distfit] >INFO> [genextreme] [0.17 sec] [RSS: 0.00300806] [loc=-0.806 scale=1.979]
# [distfit] >INFO> [gamma     ] [0.05 sec] [RSS: 0.00108459] [loc=-1862.903 scale=0.002]
# [distfit] >INFO> [lognorm   ] [0.32 sec] [RSS: 0.00121597] [loc=-110.597 scale=110.530]
# [distfit] >INFO> [beta      ] [0.10 sec] [RSS: 0.00105629] [loc=-16.364 scale=32.869]
# [distfit] >INFO> [uniform   ] [0.00 sec] [RSS: 0.287339] [loc=-6.897 scale=14.437]
# [distfit] >INFO> [loggamma  ] [0.12 sec] [RSS: 0.00109042] [loc=-370.746 scale=55.722]
# [distfit] >INFO> Compute confidence intervals [parametric]
# [distfit] >INFO> Compute significance for 9 samples.
# [distfit] >INFO> Multiple test correction method applied: [fdr_bh].
# [distfit] >INFO> Create PDF plot for the parametric method.
# [distfit] >INFO> Mark 5 significant regions
# [distfit] >INFO> Estimated distribution: beta [loc:-16.364265, scale:32.868811]

Example: Plot summary of the tested distributions

After we have a fitted model, we can make some predictions using the theoretical distributions. After making some predictions, we can plot again but now the predictions are automatically included.

Example: Make predictions using the fitted distribution

Example: Test for one specific distributions

The full list of distributions is listed here: https://erdogant.github.io/distfit/pages/html/Parametric.html

Example: Test for multiple distributions

The full list of distributions is listed here: https://erdogant.github.io/distfit/pages/html/Parametric.html

Example: Fit discrete distribution
from scipy.stats import binom
# Generate random numbers

# Set parameters for the test-case
n = 8
p = 0.5

# Generate 10000 samples of the distribution of (n, p)
X = binom(n, p).rvs(10000)
print(X)

# [5 1 4 5 5 6 2 4 6 5 4 4 4 7 3 4 4 2 3 3 4 4 5 1 3 2 7 4 5 2 3 4 3 3 2 3 5
#  4 6 7 6 2 4 3 3 5 3 5 3 4 4 4 7 5 4 5 3 4 3 3 4 3 3 6 3 3 5 4 4 2 3 2 5 7
#  5 4 8 3 4 3 5 4 3 5 5 2 5 6 7 4 5 5 5 4 4 3 4 5 6 2...]

# Import distfit
from distfit import distfit

# Initialize for discrete distribution fitting
dfit = distfit(method='discrete')

# Run distfit to and determine whether we can find the parameters from the data.
dfit.fit_transform(X)

# [distfit] >fit..
# [distfit] >transform..
# [distfit] >Fit using binomial distribution..
# [distfit] >[binomial] [SSE: 7.79] [n: 8] [p: 0.499959] [chi^2: 1.11]
# [distfit] >Compute confidence interval [discrete]

Example: Make predictions on unseen data for discrete distribution

Example: Generate samples based on the fitted distribution

Contributors

Setting up and maintaining distfit has been possible thanks to users and contributors. Thanks:

Citation

Please cite distfit in your publications if this is useful for your research. See column right for citation information.

Maintainer

  • Erdogan Taskesen, github: erdogant
  • Contributions are welcome.
  • If you wish to buy me a Coffee for this work, it is very appreciated :)

More Repositories

1

bnlearn

Python library for learning the graphical structure of Bayesian networks, parameter learning, inference and sampling methods.
Jupyter Notebook
410
star
2

pca

pca: A Python Package for Principal Component Analysis.
Jupyter Notebook
252
star
3

findpeaks

The detection of peaks and valleys in a 1d-vector or 2d-array (image)
Python
179
star
4

d3graph

Creation of interactive networks using d3 Javascript
Jupyter Notebook
149
star
5

clustimage

clustimage is a python package for unsupervised clustering of images.
Jupyter Notebook
74
star
6

hgboost

hgboost is a python package for hyper-parameter optimization for xgboost, catboost or lightboost using cross-validation, and evaluating the results on an independent validation set. hgboost can be applied for classification and regression tasks.
Python
51
star
7

clusteval

Clusteval provides methods for unsupervised cluster validation
Jupyter Notebook
46
star
8

benfordslaw

benfordslaw is about the frequency distribution of leading digits.
Python
39
star
9

undouble

Python package undouble is to detect (near-)identical images.
Python
38
star
10

kaplanmeier

kaplanmeier is an python library to create survival curves using kaplan-meier, and compute the log-rank test.
Python
26
star
11

googletrends

Google trends is to examine trending google searches on geographical location and across time for input keywords.
Python
22
star
12

hnet

Association ruled based networks using graphical Hypergeometric Networks.
Python
21
star
13

caerus

Detection of favorable moments in time series data
Python
19
star
14

treeplot

Plot tree based machine learning models
Python
11
star
15

d3heatmap

d3heatmap is a Python package to create interactive heatmaps based on d3js.
HTML
9
star
16

flameplot

flameplot is a python package for the quantification of local similarity across two maps or embeddings.
Python
8
star
17

worldmap

This python package enables to color different countries in the world or the regions per country.
Python
7
star
18

ismember

ismember
Python
7
star
19

scatterd

Scatterd is a Python package for easy and fast creation of beautiful scatter plots.
Python
7
star
20

classeval

Evaluation of supervised predictions for two-class and multi-class classifiers
Python
5
star
21

imagesc

Make quick and beautiful heatmaps
Python
4
star
22

df2onehot

Convert a unstructured array into a stuctured dataframe.
Python
3
star
23

colourmap

Colourmap generates an unique lit of RGB and HEX colors for the specified input list
Python
3
star
24

datazets

Datazets is a python package to retrieve example data sets.
Python
3
star
25

pypickle

pypickle is for saving and loading files in pickle format.
Python
2
star
26

irelease

Library that automates releasing your Github python package at Pypi.
Python
2
star
27

thompson

Thompson is Python package to evaluate the multi-armed bandit problem. In addition to thompson, Upper Confidence Bound (UCB) algorithm, and randomized results are also implemented.
Python
2
star
28

dicter

Python package with advanced dictionary functions. Traverse through nested dicts. Set and get multiple keys. Flattens dicts. Store and load in json and more!
Python
2
star
29

relevantpackage

Example of a Python Package
Python
1
star
30

bnclassify

bnlearn
Python
1
star
31

d3plus

d3plus
Python
1
star