• Stars
    star
    410
  • Rank 102,071 (Top 3 %)
  • Language
    Julia
  • License
    Other
  • Created over 9 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Julia package for evaluating distances (metrics) between vectors.

Distances.jl

Build Status Coverage Status

A Julia package for evaluating distances (metrics) between vectors.

This package also provides optimized functions to compute column-wise and pairwise distances, which are often substantially faster than a straightforward loop implementation. (See the benchmark section below for details).

Supported distances

  • Euclidean distance
  • Squared Euclidean distance
  • Periodic Euclidean distance
  • Cityblock distance
  • Total variation distance
  • Jaccard distance
  • Rogers-Tanimoto distance
  • Chebyshev distance
  • Minkowski distance
  • Hamming distance
  • Cosine distance
  • Correlation distance
  • Chi-square distance
  • Kullback-Leibler divergence
  • Generalized Kullback-Leibler divergence
  • Rรฉnyi divergence
  • Jensen-Shannon divergence
  • Mahalanobis distance
  • Squared Mahalanobis distance
  • Bhattacharyya distance
  • Hellinger distance
  • Haversine distance
  • Spherical angle distance
  • Mean absolute deviation
  • Mean squared deviation
  • Root mean squared deviation
  • Normalized root mean squared deviation
  • Bray-Curtis dissimilarity
  • Bregman divergence

For Euclidean distance, Squared Euclidean distance, Cityblock distance, Minkowski distance, and Hamming distance, a weighted version is also provided.

Basic use

The library supports three ways of computation: computing the distance between two iterators/vectors, "zip"-wise computation, and pairwise computation. Each of these computation modes works with arbitrary iterable objects of known size.

Computing the distance between two iterators or vectors

Each distance corresponds to a distance type. You can always compute a certain distance between two iterators or vectors of equal length using the following syntax

r = evaluate(dist, x, y)
r = dist(x, y)

Here, dist is an instance of a distance type: for example, the type for Euclidean distance is Euclidean (more distance types will be introduced in the next section). You can compute the Euclidean distance between x and y as

r = evaluate(Euclidean(), x, y)
r = Euclidean()(x, y)

Common distances also come with convenient functions for distance evaluation. For example, you may also compute Euclidean distance between two vectors as below

r = euclidean(x, y)

Computing distances between corresponding objects ("column-wise")

Suppose you have two m-by-n matrix X and Y, then you can compute all distances between corresponding columns of X and Y in one batch, using the colwise function, as

r = colwise(dist, X, Y)

The output r is a vector of length n. In particular, r[i] is the distance between X[:,i] and Y[:,i]. The batch computation typically runs considerably faster than calling evaluate column-by-column.

Note that either of X and Y can be just a single vector -- then the colwise function computes the distance between this vector and each column of the other argument.

Computing pairwise distances

Let X and Y have m and n columns, respectively, and the same number of rows. Then the pairwise function with the dims=2 argument computes distances between each pair of columns in X and Y:

R = pairwise(dist, X, Y, dims=2)

In the output, R is a matrix of size (m, n), such that R[i,j] is the distance between X[:,i] and Y[:,j]. Computing distances for all pairs using pairwise function is often remarkably faster than evaluting for each pair individually.

If you just want to just compute distances between all columns of a matrix X, you can write

R = pairwise(dist, X, dims=2)

This statement will result in an m-by-m matrix, where R[i,j] is the distance between X[:,i] and X[:,j]. pairwise(dist, X) is typically more efficient than pairwise(dist, X, X), as the former will take advantage of the symmetry when dist is a semi-metric (including metric).

To compute pairwise distances for matrices with observations stored in rows use the argument dims=1.

Computing column-wise and pairwise distances inplace

If the vector/matrix to store the results are pre-allocated, you may use the storage (without creating a new array) using the following syntax (i being either 1 or 2):

colwise!(r, dist, X, Y)
pairwise!(R, dist, X, Y, dims=i)
pairwise!(R, dist, X, dims=i)

Please pay attention to the difference, the functions for inplace computation are colwise! and pairwise! (instead of colwise and pairwise).

Distance type hierarchy

The distances are organized into a type hierarchy.

At the top of this hierarchy is an abstract class PreMetric, which is defined to be a function d that satisfies

d(x, x) == 0  for all x
d(x, y) >= 0  for all x, y

SemiMetric is a abstract type that refines PreMetric. Formally, a semi-metric is a pre-metric that is also symmetric, as

d(x, y) == d(y, x)  for all x, y

Metric is a abstract type that further refines SemiMetric. Formally, a metric is a semi-metric that also satisfies triangle inequality, as

d(x, z) <= d(x, y) + d(y, z)  for all x, y, z

This type system has practical significance. For example, when computing pairwise distances between a set of vectors, you may only perform computation for half of the pairs, derive the values immediately for the remaining half by leveraging the symmetry of semi-metrics. Note that the types of SemiMetric and Metric do not completely follow the definition in mathematics as they do not require the "distance" to be able to distinguish between points: for these types x != y does not imply that d(x, y) != 0 in general compared to the mathematical definition of semi-metric and metric, as this property does not change computations in practice.

Each distance corresponds to a distance type. The type name and the corresponding mathematical definitions of the distances are listed in the following table.

type name convenient syntax math definition
Euclidean euclidean(x, y) sqrt(sum((x - y) .^ 2))
SqEuclidean sqeuclidean(x, y) sum((x - y).^2)
PeriodicEuclidean peuclidean(x, y, w) sqrt(sum(min(mod(abs(x - y), w), w - mod(abs(x - y), w)).^2))
Cityblock cityblock(x, y) sum(abs(x - y))
TotalVariation totalvariation(x, y) sum(abs(x - y)) / 2
Chebyshev chebyshev(x, y) max(abs(x - y))
Minkowski minkowski(x, y, p) sum(abs(x - y).^p) ^ (1/p)
Hamming hamming(k, l) sum(k .!= l)
RogersTanimoto rogerstanimoto(a, b) 2(sum(a&!b) + sum(!a&b)) / (2(sum(a&!b) + sum(!a&b)) + sum(a&b) + sum(!a&!b))
Jaccard jaccard(x, y) 1 - sum(min(x, y)) / sum(max(x, y))
BrayCurtis braycurtis(x, y) sum(abs(x - y)) / sum(abs(x + y))
CosineDist cosine_dist(x, y) 1 - dot(x, y) / (norm(x) * norm(y))
CorrDist corr_dist(x, y) cosine_dist(x - mean(x), y - mean(y))
ChiSqDist chisq_dist(x, y) sum((x - y).^2 / (x + y))
KLDivergence kl_divergence(p, q) sum(p .* log(p ./ q))
GenKLDivergence gkl_divergence(x, y) sum(p .* log(p ./ q) - p + q)
RenyiDivergence renyi_divergence(p, q, k) log(sum( p .* (p ./ q) .^ (k - 1))) / (k - 1)
JSDivergence js_divergence(p, q) KL(p, m) / 2 + KL(q, m) / 2 with m = (p + q) / 2
SpanNormDist spannorm_dist(x, y) max(x - y) - min(x - y)
BhattacharyyaDist bhattacharyya(x, y) -log(sum(sqrt(x .* y) / sqrt(sum(x) * sum(y)))
HellingerDist hellinger(x, y) sqrt(1 - sum(sqrt(x .* y) / sqrt(sum(x) * sum(y))))
Haversine haversine(x, y, r = 6_371_000) Haversine formula
SphericalAngle spherical_angle(x, y) Haversine formula
Mahalanobis mahalanobis(x, y, Q) sqrt((x - y)' * Q * (x - y))
SqMahalanobis sqmahalanobis(x, y, Q) (x - y)' * Q * (x - y)
MeanAbsDeviation meanad(x, y) mean(abs.(x - y))
MeanSqDeviation msd(x, y) mean(abs2.(x - y))
RMSDeviation rmsd(x, y) sqrt(msd(x, y))
NormRMSDeviation nrmsd(x, y) rmsd(x, y) / (maximum(x) - minimum(x))
WeightedEuclidean weuclidean(x, y, w) sqrt(sum((x - y).^2 .* w))
WeightedSqEuclidean wsqeuclidean(x, y, w) sum((x - y).^2 .* w)
WeightedCityblock wcityblock(x, y, w) sum(abs(x - y) .* w)
WeightedMinkowski wminkowski(x, y, w, p) sum(abs(x - y).^p .* w) ^ (1/p)
WeightedHamming whamming(x, y, w) sum((x .!= y) .* w)
Bregman bregman(F, โˆ‡, x, y; inner=dot) F(x) - F(y) - inner(โˆ‡(y), x - y)

Note: The formulas above are using Julia's functions. These formulas are mainly for conveying the math concepts in a concise way. The actual implementation may use a faster way. The arguments x and y are iterable objects, typically arrays of real numbers; w is an iterator/array of parameters (like weights or periods); k and l are iterators/arrays of distinct elements of any kind; a and b are iterators/arrays of Bools; and finally, p and q are iterators/arrays forming a discrete probability distribution and are therefore both expected to sum to one.

Precision for Euclidean and SqEuclidean

For efficiency (see the benchmarks below), Euclidean and SqEuclidean make use of BLAS3 matrix-matrix multiplication to calculate distances. This corresponds to the following expansion:

(x-y)^2 == x^2 - 2xy + y^2

However, equality is not precise in the presence of roundoff error, and particularly when x and y are nearby points this may not be accurate. Consequently, Euclidean and SqEuclidean allow you to supply a relative tolerance to force recalculation:

julia> x = reshape([0.1, 0.3, -0.1], 3, 1);

julia> pairwise(Euclidean(), x, x)
1ร—1 Array{Float64,2}:
 7.45058e-9

julia> pairwise(Euclidean(1e-12), x, x)
1ร—1 Array{Float64,2}:
 0.0

Benchmarks

The implementation has been carefully optimized based on benchmarks. The script in benchmark/benchmarks.jl defines a benchmark suite for a variety of distances, under column-wise and pairwise settings.

Here are benchmarks obtained running Julia 1.5 on a computer with a quad-core Intel Core i5-2300K processor @ 3.2 GHz. Extended versions of the tables below can be replicated using the script in benchmark/print_table.jl.

Column-wise benchmark

Generically, column-wise distances are computed using a straightforward loop implementation. For [Sq]Mahalanobis, however, specialized methods are provided in Distances.jl, and the table below compares the performance (measured in terms of average elapsed time of each iteration) of the generic to the specialized implementation. The task in each iteration is to compute a specific distance between corresponding columns in two 200-by-10000 matrices.

distance loop colwise gain
SqMahalanobis 0.089470s 0.014424s 6.2027
Mahalanobis 0.090882s 0.014096s 6.4475

Pairwise benchmark

Generically, pairwise distances are computed using a straightforward loop implementation. For distances of which a major part of the computation is a quadratic form, however, the performance can be drastically improved by restructuring the computation and delegating the core part to GEMM in BLAS. The table below compares the performance (measured in terms of average elapsed time of each iteration) of generic to the specialized implementations provided in Distances.jl. The task in each iteration is to compute a specific distance in a pairwise manner between columns in a 100-by-200 and 100-by-250 matrices, which will result in a 200-by-250 distance matrix.

distance loop pairwise gain
SqEuclidean 0.001273s 0.000124s 10.2290
Euclidean 0.001445s 0.000194s 7.4529
CosineDist 0.001928s 0.000149s 12.9543
CorrDist 0.016837s 0.000187s 90.1854
WeightedSqEuclidean 0.001603s 0.000143s 11.2119
WeightedEuclidean 0.001811s 0.000238s 7.6032
SqMahalanobis 0.308990s 0.000248s 1248.1892
Mahalanobis 0.313415s 0.000346s 906.1836

More Repositories

1

Distributions.jl

A Julia package for probability distributions and associated functions.
Julia
1,073
star
2

GLM.jl

Generalized linear models in Julia
Julia
574
star
3

StatsBase.jl

Basic statistics for Julia
Julia
565
star
4

MixedModels.jl

A Julia package for fitting (statistical) mixed-effects models
Julia
395
star
5

MultivariateStats.jl

A Julia package for multivariate statistics and data analysis (e.g. dimension reduction)
Julia
370
star
6

Clustering.jl

A Julia package for data clustering
Julia
343
star
7

TimeSeries.jl

Time series toolkit for Julia
Julia
342
star
8

HypothesisTests.jl

Hypothesis tests for Julia
Julia
287
star
9

StatsModels.jl

Specifying, fitting, and evaluating statistical models in Julia
Julia
245
star
10

StatsFuns.jl

Mathematical functions related to statistics.
Julia
224
star
11

MLBase.jl

A set of functions to support the development of machine learning algorithms
Julia
186
star
12

KernelDensity.jl

Kernel density estimators for Julia
Julia
169
star
13

Klara.jl

MCMC inference in Julia
Julia
168
star
14

RDatasets.jl

Julia package for loading many of the data sets available in R
R
156
star
15

Lasso.jl

Lasso/Elastic Net linear and generalized linear models
Julia
141
star
16

StatsKit.jl

Convenience meta-package to load essential packages for statistics
Julia
136
star
17

PDMats.jl

Uniform Interface for positive definite matrices of various structures
Julia
103
star
18

GLMNet.jl

Julia wrapper for fitting Lasso/ElasticNet GLM models using glmnet
Julia
96
star
19

Loess.jl

Local regression, so smooooth!
Julia
95
star
20

NMF.jl

A Julia package for non-negative matrix factorization
Julia
89
star
21

LogExpFunctions.jl

Julia package for various special functions based on `log` and `exp`.
Julia
73
star
22

Survival.jl

Survival analysis in Julia
Julia
69
star
23

Statistics.jl

The Statistics stdlib that ships with Julia.
Julia
64
star
24

TimeModels.jl

Modeling time series in Julia
Julia
57
star
25

DataArrays.jl

DEPRECATED: Data structures that allow missing values
Julia
53
star
26

PGM.jl

A Julia framework for probabilistic graphical models.
Julia
52
star
27

ConjugatePriors.jl

A Julia package to support conjugate prior distributions.
Julia
46
star
28

SVM.jl

SVM's for Julia
Julia
41
star
29

Roadmap.jl

A centralized location for planning the direction of JuliaStats
35
star
30

NullableArrays.jl

DEPRECATED Prototype of the new JuliaStats NullableArrays package
Julia
35
star
31

Distance.jl

Julia module for Distance evaluation
Julia
27
star
32

DimensionalityReduction.jl

Deprecated in favor of MultivariateStats.jl
Julia
27
star
33

Rmath-julia

The Rmath library from R
C
25
star
34

RegERMs.jl

DEPRECATED: Regularised Empirical Risk Minimisation Framework (SVMs, LogReg, Linear Regression) in Julia
Julia
17
star
35

StatsAPI.jl

A statistics-focused namespace for packages to share functions
Julia
17
star
36

Rmath.jl

Archive of functions that emulate R's d-p-q-r functions for probability distributions
Julia
16
star
37

JuliaStats.github.io

The home page of JuliaStats
CSS
11
star
38

NullableStats.jl

DEPRECATED: Statistical functionality for NullableArrays
Julia
4
star
39

RmathBuilder

Builder repository for rmath-julia
Julia
3
star
40

RmathDist.jl

Julia interface to Rmath distribution functions
Julia
3
star