• Stars
    star
    379
  • Rank 113,004 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 10 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark

tdigest

Efficient percentile estimation of streaming or distributed data

PyPI version Build Status

This is a Python implementation of Ted Dunning's t-digest data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).

See a blog post about it here: Percentile and Quantile Estimation of Big Data: The t-Digest

Installation

tdigest is compatible with both Python 2 and Python 3.

pip install tdigest

Usage

Update the digest sequentially

from tdigest import TDigest
from numpy.random import random

digest = TDigest()
for x in range(5000):
    digest.update(random())

print(digest.percentile(15))  # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution

Update the digest in batches

another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))

Sum two digests to create a new digest

sum_digest = digest + another_digest 
sum_digest.percentile(30)  # about 0.3

To dict or serializing a digest with JSON

You can use the to_dict() method to turn a TDigest object into a standard Python dictionary.

digest = TDigest()
digest.update(1)
digest.update(2)
digest.update(3)
print(digest.to_dict())

Or you can get only a list of Centroids with centroids_to_list().

digest.centroids_to_list()

Similarly, you can restore a Python dict of digest values with update_from_dict(). Centroids are merged with any existing ones in the digest. For example, make a fresh digest and restore values from a python dictionary.

digest = TDigest()
digest.update_from_dict({'K': 25, 'delta': 0.01, 'centroids': [{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}]})

K and delta values are optional, or you can provide only a list of centroids with update_centroids_from_list().

digest = TDigest()
digest.update_centroids([{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}])

If you want to serialize with other tools like JSON, you can first convert to_dict().

json.dumps(digest.to_dict())

Alternatively, make a custom encoder function to provide as default to the standard json module.

def encoder(digest_obj):
    return digest_obj.to_dict()

Then pass the encoder function as the default parameter.

json.dumps(digest, default=encoder)

API

TDigest.

  • update(x, w=1): update the tdigest with value x and weight w.
  • batch_update(x, w=1): update the tdigest with values in array x and weight w.
  • compress(): perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
  • percentile(p): return the pth percentile. Example: p=50 is the median.
  • cdf(x): return the CDF the value x is at.
  • trimmed_mean(p1, p2): return the mean of data set without the values below and above the p1 and p2 percentile respectively.
  • to_dict(): return a Python dictionary of the TDigest and internal Centroid values.
  • update_from_dict(dict_values): update from serialized dictionary values into the TDigest object.
  • centroids_to_list(): return a Python list of the TDigest object's internal Centroid values.
  • update_centroids_from_list(list_values): update Centroids from a python list.

More Repositories

1

Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)
Jupyter Notebook
26,646
star
2

lifelines

Survival analysis in Python
Python
2,337
star
3

lifetimes

Lifetime value in Python
Python
1,446
star
4

PyProcess

Generate stochastic processes using Python. Unfortunately not maintained any longer =(
Python
113
star
5

StartupFiles

My IPython startup files.
Python
109
star
6

Python-Numerics

Numerical machines in Python
Python
95
star
7

lifestyles

Work-In-Progress: conjoint analysis in Python
Python
52
star
8

lifelike

WIP predicted survival functions
Python
37
star
9

Graphical-Lasso-in-Finance

Implementations of the graphical lasso method to estimation of covariance matrices in finance.
Python
37
star
10

Subwayjs

make an interactive subway map in javascript.
JavaScript
33
star
11

decision-weights

Homegrown analysis of Prospect Theory: Math, turkers and python =)
Python
33
star
12

PyconCanada2015

My scrapers, data and analysis for PyCon Canada 2015 Keynote
Python
26
star
13

PyDataNY_2019_tutorial

Repo for PyData 2019 Tutorial - New Trends in Estimation and Inference
Jupyter Notebook
26
star
14

demographica

Analyse US name distributions and create age profiles of your users
Python
18
star
15

autograd-gamma

NotImplementedError: VJP of gammainc wrt argnum 0 not defined
Python
15
star
16

PasswordAnalysis

This is a description of human-created passwords using markov models
Python
14
star
17

dog

a simple casual graph evaluator (for experiments)
Python
13
star
18

McData

Repo for data surrounding fast food nutrition and ingredients
Python
10
star
19

SMS_Terminal

Turn your Android into a SMS-based terminal line using Python!
Python
8
star
20

simpsons-paradox

use Python to detect Simpson's paradox
Python
7
star
21

The-Golden-Retrieber

A classification algorithm that classifies Justin Bieber in Twitter display pictures
7
star
22

lifelines-replications

Using lifelines to replicate published articles
Jupyter Notebook
6
star
23

compilers

HTML
5
star
24

python_packages_survey

Python
5
star
25

Playground

Some small scripts that I use
Python
3
star
26

pipp

recommendations after using pip, for PyCon Canada 2015
Python
3
star
27

spec_utils

Python
2
star
28

mIPython

Analyze your common IPython operations.
Python
2
star
29

tf-examples

my tf examples for now
Python
2
star
30

Twittxor

A web-based Twitter game!
Python
2
star
31

projecteuler-utils

utils for working on project euler (no solutions)
Python
2
star
32

heroes

heroes of the storm analysis
Jupyter Notebook
2
star
33

backwards_harmonic

Jupyter Notebook
1
star
34

yeast_counting

Python
1
star
35

coursera

coursera assignments
R
1
star
36

uoft-notes

Course notes for sessions of 2943 and 3030
1
star
37

permutations

Hacking on cycles and permutations
Jupyter Notebook
1
star
38

python-party

Automatically exported from code.google.com/p/python-party
Python
1
star
39

incubator

Python
1
star
40

ipd

simple example of zero-determinant iterated prisoner's dilemma
Python
1
star
41

set_loop

Python
1
star
42

ontario_demographica

Python
1
star
43

riddler-solutions

solution to fivethirtyeight's riddler problems
Python
1
star
44

eem_analysis

Python
1
star
45

demo-repo

This repo
Python
1
star