• Stars
    star
    205
  • Rank 191,264 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 4 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🐍 Python library implementing sorted containers with state-of-the-art query performance and compressed memory usage

pygm

PyGM is a Python library that enables fast query operations on sorted lists of numbers (like integers and floats) with a tiny memory overhead. Internally, it features the PGM-index, a state-of-the-art learned data structure that robustly scales to billions of elements in just a few tens of megabytes.

Build status Code coverage PyPI License GitHub stars GitHub forks

Quick start

Install with pip:

pip install pygm

PyGM supports both standard and other useful list and set operations:

>>> from pygm import SortedList, SortedSet
>>> sl = SortedList([0, 1, 34, 144, 1, 55, 233, 2, 3, 21, 89, 5, 8, 13])
>>> sl
SortedList([0, 1, 1, ..., 144, 233])
>>> sl.find_gt(9)                                   # smallest element > 9
13
>>> sl.count(1)                                     # frequency of 1
2
>>> 42 in sl                                        # membership test
False
>>> list(sl.range(0, 21, inclusive=(False, True)))  # elements 0 < x <= 21
[1, 1, 2, 3, 5, 8, 13, 21]
>>> sl[2:10:3]                                      # slicing syntax support
SortedList([1, 5, 21])
>>> (sl + [-3, -2, -1]).rank(0)                     # number of elements <= 0
4
>>> ss = SortedSet([1, 2, 3, 4]) ^ {3, 4, 5}        # set symmetric difference
>>> ss.find_lt(5)
2

The full documentation is available online and in the Python interpreter via the help() built-in function.

Installation

PyGM is compatible with Python 3.3+. The easiest way to install it is through PyPI:

pip install pygm

Otherwise, you can clone the repo, build it from source and install it as follows:

if [[ "$(uname)" == "Darwin" ]]; then brew install libomp; fi
git clone https://github.com/gvinciguerra/PyGM.git
cd PyGM
git submodule update --init --recursive
pip install .

Remember to leave the source directory PyGM/ and its parent before running Python.

Performance

Here are some plots that compare the performance of PyGM with two popular libraries, sortedcontainers and blist, on synthetic data.

Query performance of Python packages implementing sorted lists

All the graphs are log-log and show, for increasing data sizes, the average time it takes to perform each operation (lower is better). In particular, the __init__ plot shows the construction time, __contains__ measures membership queries, and __getitem__ shows the cost of accessing an element given its position.

The interesting operations on sorted lists are: (i) index, which returns the position of the first occurrence of a given element; (ii) count, which returns the number of occurrences of a given element; (iii) bisect_left, which returns the insertion point for a given value in the list to maintain the sorted order (and is used to implement find_[ge|gt|le|lt]).

You can run and plot the experiments on your computer and your data with the notebook in tests/benchmark.ipynb.

License

This project is licensed under the terms of the Apache License 2.0.

If you use the library in an academic setting, please cite the following paper:

Paolo Ferragina and Giorgio Vinciguerra. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. PVLDB, 13(8): 1162-1175, 2020.

@article{Ferragina:2020pgm,
  Author = {Paolo Ferragina and Giorgio Vinciguerra},
  Title = {The {PGM-index}: a fully-dynamic compressed learned index with provable worst-case bounds},
  Year = {2020},
  Volume = {13},
  Number = {8},
  Pages = {1162--1175},
  Doi = {10.14778/3389133.3389135},
  Url = {https://pgm.di.unipi.it},
  Issn = {2150-8097},
  Journal = {{PVLDB}}}

More Repositories

1

PGM-index

πŸ…State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes
C++
769
star
2

la_vector

πŸ”Ά Compressed bitvector/container supporting efficient random access and rank queries
C++
40
star
3

unlister

πŸ“¬ Script for Mail on macOS that automatically unsubscribes from promotional emails and newsletters
AppleScript
38
star
4

Learned-indexes-effectiveness

Code for the TCS paper "On the performance of learned data structures" and the ICML paper "Why are learned indexes so effective?"
C++
18
star
5

countmein

🐾 People counting and surveillance with IoT devices
Python
11
star
6

BlockEpsilonTree

🌳 A compressed rank/select dictionary exploiting approximate linearity and repetitiveness.
C++
11
star
7

nnweaver

🧠 + πŸ•Έ = Neural Network Weaver. A tiny library to build and train neural networks
Python
8
star
8

PrefixPGM

Proof-of-concept extension of the PGM-index to support fixed-length strings
C++
6
star
9

CSS-tree

Single-header C++11 implementation of the Cache Sensitive Search tree (CSS-tree)
C++
6
star
10

custode

πŸ‘ΌπŸ» Personal safety app for Android
Java
5
star
11

LZEpsilon

Compressed rank/select dictionary based on Lempel-Ziv and LA-vector compression
C++
5
star
12

RearCodedArray

Compressed string dictionary based on rear-coding
C++
4
star
13

AE2020-tutorial

Repository for the students of Algorithm Engineering @ UNIPI πŸ‘Ό
C++
3
star
14

AE2022-tutorial

Repository for the students of Algorithm Engineering @ UNIPI πŸ‘Ό
C++
3
star
15

divide-and-conquer

An implementation of the divide-and-conquer parallel pattern
C++
1
star
16

TDP-2017

Application of Design Patterns to Conway's Game of Life, and other Design Patterns exercises in Java.
Java
1
star