• Stars
    star
    262
  • Rank 152,577 (Top 4 %)
  • Language
    C
  • License
    BSD 2-Clause "Sim...
  • Created almost 7 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

⚡ Fast 1D and 2D histogram functions in Python ⚡

CI Status asv

About

Sometimes you just want to compute simple 1D or 2D histograms with regular bins. Fast. No nonsense. Numpy's histogram functions are versatile, and can handle for example non-regular binning, but this versatility comes at the expense of performance.

The fast-histogram mini-package aims to provide simple and fast histogram functions for regular bins that don't compromise on performance. It doesn't do anything complicated - it just implements a simple histogram algorithm in C and keeps it simple. The aim is to have functions that are fast but also robust and reliable. The result is a 1D histogram function here that is 7-15x faster than numpy.histogram, and a 2D histogram function that is 20-25x faster than numpy.histogram2d.

To install:

pip install fast-histogram

or if you use conda you can instead do:

conda install -c conda-forge fast-histogram

The fast_histogram module then provides two functions: histogram1d and histogram2d:

from fast_histogram import histogram1d, histogram2d

Example

Here's an example of binning 10 million points into a regular 2D histogram:

In [1]: import numpy as np

In [2]: x = np.random.random(10_000_000)

In [3]: y = np.random.random(10_000_000)

In [4]: %timeit _ = np.histogram2d(x, y, range=[[-1, 2], [-2, 4]], bins=30)
935 ms ± 58.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: from fast_histogram import histogram2d

In [6]: %timeit _ = histogram2d(x, y, range=[[-1, 2], [-2, 4]], bins=30)
40.2 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

(note that 10_000_000 is possible in Python 3.6 syntax, use 10000000 instead in previous versions)

The version here is over 20 times faster! The following plot shows the speedup as a function of array size for the bin parameters shown above:

Comparison of performance between Numpy and fast-histogram

as well as results for the 1D case, also with 30 bins. The speedup for the 2D case is consistently between 20-25x, and for the 1D case goes from 15x for small arrays to around 7x for large arrays.

Q&A

Why don't the histogram functions return the edges?

Computing and returning the edges may seem trivial but it can slow things down by a factor of a few when computing histograms of 10^5 or fewer elements, so not returning the edges is a deliberate decision related to performance. You can easily compute the edges yourself if needed though, using numpy.linspace.

Doesn't package X already do this, but better?

This may very well be the case! If this duplicates another package, or if it is possible to use Numpy in a smarter way to get the same performance gains, please open an issue and I'll consider deprecating this package :)

One package that does include fast histogram functions (including in n-dimensions) and can compute other statistics is vaex, so take a look there if you need more advanced functionality!

Are the 2D histograms not transposed compared to what they should be?

There is technically no 'right' and 'wrong' orientation - here we adopt the convention which gives results consistent with Numpy, so:

numpy.histogram2d(x, y, range=[[xmin, xmax], [ymin, ymax]], bins=[nx, ny])

should give the same result as:

fast_histogram.histogram2d(x, y, range=[[xmin, xmax], [ymin, ymax]], bins=[nx, ny])

Why not contribute this to Numpy directly?

As mentioned above, the Numpy functions are much more versatile, so they could not be replaced by the ones here. One option would be to check in Numpy's functions for cases that are simple and dispatch to functions such as the ones here, or add dedicated functions for regular binning. I hope we can get this in Numpy in some form or another eventually, but for now, the aim is to have this available to packages that need to support a range of Numpy versions.

Why not use Cython?

I originally implemented this in Cython, but found that I could get a 50% performance improvement by going straight to a C extension.

What about using Numba?

I specifically want to keep this package as easy as possible to install, and while Numba is a great package, it is not trivial to install outside of Anaconda.

Could this be parallelized?

This may benefit from parallelization under certain circumstances. The easiest solution might be to use OpenMP, but this won't work on all platforms, so it would need to be made optional.

Couldn't you make it faster by using the GPU?

Almost certainly, though the aim here is to have an easily installable and portable package, and introducing GPUs is going to affect both of these.

Why make a package specifically for this? This is a tiny amount of functionality

Packages that need this could simply bundle their own C extension or Cython code to do this, but the main motivation for releasing this as a mini-package is to avoid making pure-Python packages into packages that require compilation just because of the need to compute fast histograms.

Can I contribute?

Yes please! This is not meant to be a finished package, and I welcome pull request to improve things.

More Repositories

1

psrecord

Record the CPU and memory activity of a process 📈
Python
528
star
2

mpl-scatter-density

⚡ Fast scatter density plots for Matplotlib ⚡
Python
481
star
3

fortranlib

Collection of personal scientific routines in Fortran 📖
Fortran
281
star
4

pypi-timemachine

Install packages with pip as if you were in the past!
Python
88
star
5

acknowledgment-generator

Easily generate acknowledgment sections for papers
JavaScript
40
star
6

numtraits

Sanity checking for numerical properties/traits 🔢
Python
36
star
7

wcsaxes

wcsaxes has been merged into astropy!
Python
22
star
8

py4sci

Python Programming for Scientists - Lecture notes
HTML
21
star
9

sedfitter

Python version of the SED fitter from Robitaille et al., 2007, ApJS 169 328
Python
20
star
10

pyavm

Pure-python AVM library
Python
19
star
11

autowheel

Automatically build wheels for packages released on PyPI
Python
15
star
12

python-qt-tutorial

Python Qt tutorial
Python
14
star
13

idlsave

IDLSave - a python module to read IDL 'save' files
Python
12
star
14

colormapize

Generate colormaps from images!
Python
10
star
15

voila-qt-app

Jupyter Notebook
8
star
16

genetic

Very simple parallel genetic algorithm code
Python
8
star
17

scientific-python-survey-2015

Results for the 2015 Scientific Python survey
7
star
18

multistatus

This is no longer needed since GitHub now has an official version of this!
Python
6
star
19

example-travis-conda

How to use Miniconda to install dependencies on Travis CI
5
star
20

python-montage

This package is deprecated - please see
Python
5
star
21

python4vienna

Python/Astropy course at the University of Vienna, June 1st-3rd 2015
Python
4
star
22

py4sci-notes

Python
4
star
23

git-workflows

Scripts used to perform various complex git actions
Shell
4
star
24

pieceofcake

a user-friendly cookiecutter wrapper 🍰 ❤️ 🍪
Python
3
star
25

auto_bibtex

Automatically produce BibTeX file for LaTeX manuscript using the NASA ADS database
Python
3
star
26

vtk_python_sandbox

Python
3
star
27

Astropy4MPIK

Astropy workshop for the Max-Planck-Institut für Kernphysik
Python
3
star
28

macports-python

Installation instructions for Python using MacPorts
3
star
29

problem_set_7

Problem Set 7 for the course Python: Programming for Scientists
Python
3
star
30

batchpr

Package in need of a better name to automate opening pull requests 🤖
Python
3
star
31

mpl_styles

Python
3
star
32

astrodendro-deprecated

Computing Astronomical Dendrograms
Python
3
star
33

robo-ph

#dotastro hack
Python
3
star
34

empty_folders

Simple Automator app to find and trash empty folders
3
star
35

tox-timemachine

Python
2
star
36

dasktropy

Jupyter Notebook
2
star
37

mpia_contributing

2
star
38

calling-c-libraries-from-python

Experiments with linking to C libraries from Python
Python
2
star
39

vispy-multivol

MultiVolumeVisual class for Vispy that allows multiple volumes to be shown at the same time
Python
2
star
40

mining_acknowledgments

1
star
41

arxivminer

ArXiV miner
Python
1
star
42

cube-viewer

Experiments with 3-d spectral cube viewing
Python
1
star
43

fun-with-adsb

Scripts related to ADS-B data
Python
1
star
44

dotastro8-remote

Jupyter Notebook
1
star
45

image_format

Experimental: Understanding the CASA image format
Python
1
star
46

python-versions-survey

Survey conducted in November 2012 to find out about Scientific Python Installations
Python
1
star
47

astropy_issue_stats

Statistics on open/closed Astropy issues
Python
1
star
48

sedfitter-legacy

Fortran/Legacy version of the SED fitter from Robitaille et al., 2007, ApJS 169 328
Fortran
1
star
49

generate-setup-cfg

Script to generate setup.cfg files
Python
1
star
50

python-intro

Jupyter Notebook
1
star
51

wheel-forge

1
star
52

editable-mpl-selectors

Experimental Matplotlib compatible selectors
Python
1
star
53

fractal

Fractal distribution of points
Python
1
star
54

astropy4cambridge

Astropy workshop at the University of Cambridge
Jupyter Notebook
1
star
55

py2app-experiments

Experiments with Py2App
Python
1
star
56

python4imprs

Python for IMPRS students
Python
1
star
57

astropy-graphs

Various graphs related to the Astropy project
Python
1
star
58

mpl_font_testing

Python
1
star
59

timecard

A simple Python + Dropbox based command-line timecard
Python
1
star
60

casa-astropy

Experimental code for linking CASA and Astropy
1
star
61

freetype_version_testing

Python
1
star