lazysorted
lazysorted is a Python extension module for sorting sequences lazily. It presents the programmer with the abstraction that they are actually working with a sorted list, when in fact the list is only physically sorted when the programmer requests elements from it, and even then it is only sorted partially, just enough to return whatever was requested.
The LazySorted object has a constructor that implements the same interface as
the builtin sorted(...)
function, and it supports most of the non-mutating
methods of a python list.
Since the LazySorted object only sorts as much as necessary, it can be faster
than using the builtin sorted(...)
for tasks that do not require the entire
data to be sorted, like:
- Computing medians
- Computing truncated means
- Quickly iterating through the first few sorted elements of a list
- Computing the deciles or quartiles of some data
How to use it
You can use LazySorted in much the same way you use the sorted(...)
function
and the python lists it produces:
from lazysorted import LazySorted
from math import floor, ceil
def median(xs):
"""An expected linear time median function"""
ls = LazySorted(xs)
n = len(ls)
if n == 0:
raise ValueError("Need a non-empty iterable")
elif n % 2 == 1:
return ls[n//2]
else:
return sum(ls[(n/2-1):(n/2+1)]) / 2.0
def top_k(xs, k, key=None, reverse=False):
"""Efficiently computes the top k elements of xs using the given key, or
the bottom k if reverse=True"""
ls = LazySorted(xs, key=key, reverse=reverse)
return ls[0:k]
def trimmed_mean(xs, alpha=0.05):
"""Computes the mean of the xs from the alpha to (1-alpha) quantiles
in expected linear time. More robust than the ordinary sample mean."""
if not 0 <= alpha < 0.5:
raise ValueError("alpha must be in [0, 0.5)")
ls = LazySorted(xs)
n = len(ls)
if n == 0:
raise ValueError("Need a non-empty iterable")
lower = int(floor(n * alpha))
upper = int(ceil(n * (1 - alpha)))
return sum(ls.between(lower, upper)) / (upper - lower)
In addition to the __len__
and __getitem__
methods demonstrated above,
LazySorted also supports the __iter__
, __contains__
, index
, and count
methods, just like a regular python list:
>>> import random
>>> from lazysorted import LazySorted
>>> xs = list(range(1000)) + 5 * [1234]
>>> random.shuffle(xs)
>>> ls = LazySorted(xs)
>>> for x in ls:
... print(x)
... if x >= 3:
... break
0
1
2
3
>>> 1235 in ls
False
>>> ls.index(821)
821
>>> ls.count(1234)
5
Although the LazySorted constructor pretends to be equivalent to the sorted
function, and the LazySorted object pretends to be equivalent to a sorted python
list, there are a few differences between them:
- LazySorted objects are immutable, while python lists are not.
- Sorting with the builtin
sorted
function is guaranteed to be stable, (ie, preserve the original order of elements that compare equal), while LazySorted sorting is not stable. - The LazySorted object has a
between(i, j)
method, which returns a list of all the items whose sorted indices are inrange(i, j)
, but not necessarily in order. This is useful, for example, for throwing away outliers when computing an alpha-trimmed mean.
When the APIs differ between python2.x and python3.x, lazysorted implements the
python3.x version. So the LazySorted constructor does not support the cmp
argument that was removed in python3.x, and the LazySorted object does not
support the __getslice__
method that was also removed in python3.x.
All of the LazySorted methods have pretty good documentation, which can be
accessed through the builtin help(...)
function.
I've tested lazysorted and found it to work for CPython versions 2.5, 2.6, 2.7, and 3.1, 3.2, and 3.3. I haven't tested 3.0.
How it works
In short, LazySorted works by using quicksort partitions lazily and keeping track of the indices used as pivots.
quicksort sorts a list by picking an element of the list to be the "pivot", and then partitioning the data into the part that's greater than or equal to the pivot and the part that's less than the pivot. These two parts are then recursively sorted with quicksort,
quickselect finds the kth smallest element of a list by picking a pivot element and partitioning the data, as in quicksort. Then the algorithm recurses into the larger or smaller part of the list, depending on whether k is larger or smaller than the index of the pivot element.
There are two key observations to make from these algorithms: First of all, if we are only interested in part of a sorted list, we only need to recurse into the part we are interested in after doing a partition. Second of all, after doing some partitions, the list is partially sorted, with the pivots all in their sorted order and the elements between two pivots guaranteed to be bigger than the pivot to their left and smaller than the pivot to their right.
So whenever some data is queried from a LazySorted object, we first look through the pivots to see which pivots delimit the data we want. Then we partition sublist(s) as necessary and recurse into the side(s) that our data is in.
There are also some implementation details that help lazysorted to run quickly: First of all, pivots elements are chosen to be the median of three randomly selected elements, which makes the partition likely to be more balanced and guarantees average case O(n log n) behavior.
Second of all, for sufficiently small lists, lazysorted uses insertion sort instead of quicksort, which is faster on small lists. Both of these tricks are well-known to speed up quicksort implementations.
Thirdly, since it's important to find the pivots that bound an index quickly, lazysorted stores the pivots in a binary search tree, so that these sorts of lookups occur in O(log n) expected time. The BST lazysorted uses is a Treap, selected for its overall expected speed, especially in insertion and deletion.
lazysorted also makes a big effort to delete irrelevant pivots from the BST; for example, if there are three pivots at indices 5, 26, and 42, and both the data (between 5 and 26) and (between 26 and 42) is sorted, then we can remove the irrelevant pivot 26, and just say that the data between indices 5 and 42 is sorted.
Installation
lazysorted requires the python headers, (Python.h). I believe they ship with OSX, but if you don't have them they can be installed on a debian-like system with
$ sudo apt-get install python-dev
Then you can install lazysorted with
$ sudo python setup.py install
Alternatively, you can install lazysorted from pypi with
$ easy_install --user lazysorted
or
$ pip install lazysorted
though you'll still need the python headers for it to build properly.
Testing
I've put in a fair bit of effort to test that lazysorted actually does what it's supposed to. You can test it yourself (after installing it) with
$ python test.py
FAQ
Doesn't numpy have a median and percentile function?
Yes, but it's implemented by sorting the entire array and then reading off the requested values, not with quickselect or another O(n) selection algorithm. And LazySorted is empirically faster, as you can see from benchmark.py
Isn't python3.4 going to have a statistics module with a median function?
Yes, and I'm really excited about it! This is PEP450. Unfortunately, the current implementation is in pure python, and computes the median by sorting the data and picking off the middle element.
Doesn't the standard library have a heapq module?
Yes, but it lacks the full generality of this module. For example, you can use it to get the k smallest elements in O(n log k) time, but not k arbitrary contiguous elements. This module represents a different paradigm: you're allowed to program as if your list was sorted, and let the data structure deal with the details.
How is lazysorted licensed?
lazysorted is BSD-licensed. So you can use it pretty much however you like! See LICENSE for details.
What should I not use lazysorted for?
- Applications requiring a stable sort; the quicksort partitions make the order of equal elements in the sorted list undefined.
- Applications requiring guaranteed fast worst-case performance. Although it's very unlikely, many operations in LazySorted run in worst case O(n^2) time.
- Applications requiring high security. The random number generator is insecure and seeded from system time, so an (ambitious) attacker could reverse engineer the random number generator and feed LazySorted pathological lists that make it run in O(n^2) time.
- Sorting entire lists: The builtin
sorted(...)
is very impressively designed and implemented. It also has the advantage of running faster than O(n log n) on lists with partial structure.
How does lazysorted work at scale?
Unfortunately, only okay. This turns out to be primarily due to the fact that CPython deals with python objects by passing around pointers to them, causing cache misses when the list and its elements no longer fit in cache. The gory details can be found in a blog post I wrote about Memory Locality and Python Objects.
However, this effect doesn't kick in until lists grow larger than about 100K values, and even past that lazysorted remains faster than complete sorting.
Contact me!
If you use this software and feel so inclined, I'd greatly appreciate hearing what you are using it for! You can hit me up on Twitter @naftaliharris, or at my email address on my contact page.