libpopcnt
libpopcnt.h
is a header-only C/C++ library for counting the
number of 1 bits (bit population count) in an array as quickly as
possible using specialized CPU instructions i.e.
POPCNT,
AVX2,
AVX512,
NEON.
libpopcnt.h
has been tested successfully using the GCC,
Clang and MSVC compilers.
The algorithms used in libpopcnt.h
are described in the paper
Faster Population Counts using AVX2 Instructions
by Daniel Lemire, Nathan Kurz and Wojciech Mula (23 Nov 2016).
How it works
On x86 CPUs libpopcnt.h
uses a combination of 4 different bit
population count algorithms:
- For array sizes < 512 bytes a
POPCNT
algorithm is used. - For array sizes ≥ 512 bytes an
AVX2
algorithm is used. - For array sizes ≥ 1024 bytes an
AVX512
algorithm is used. - For CPUs without
POPCNT
instruction a portable integer algorithm is used.
Note that libpopcnt.h
works on all CPUs, it checks at run-time
whether your CPU supports POPCNT, AVX2, AVX512 before using it
and it is also thread-safe.
C/C++ API
#include "libpopcnt.h"
/*
* Count the number of 1 bits in the data array
* @data: An array
* @size: Size of data in bytes
*/
uint64_t popcnt(const void* data, uint64_t size);
Speedup
This benchmark shows the speedup of the 4 popcount algorithms used on x86 CPUs compared to the basic lookup-8 popcount algorithm for different array sizes (in bytes).
Algorithm | 32 B | 64 B | 128 B | 256 B | 512 B | 1024 B | 2048 B | 4096 B |
lookup-8 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
bit-parallel-mul | 1.41 | 1.54 | 1.63 | 1.78 | 1.60 | 1.62 | 1.63 | 1.64 |
builtin-popcnt | 4.75 | 6.36 | 8.58 | 8.55 | 6.72 | 7.60 | 7.88 | 7.94 |
avx2-harley-seal | 1.15 | 1.85 | 3.22 | 4.17 | 8.46 | 10.74 | 12.52 | 13.66 |
avx512-harley-seal | 0.35 | 1.49 | 2.54 | 3.83 | 5.63 | 15.12 | 22.18 | 25.60 |
libpopcnt.h
automatically picks the fastest algorithm for
the given array size. This benchmark was run on an Intel Xeon
Platinum 8168 CPU with GCC 5.4.
CPU architectures
libpopcnt.h
has hardware accelerated popcount algorithms for
the following CPU architectures:
x86 | POPCNT , AVX2 , AVX512 |
x86-64 | POPCNT , AVX2 , AVX512 |
ARM | NEON |
PPC64 | POPCNTD |
For other CPU architectures a fast integer popcount algorithm is used.
How to compile
libpopcnt.h
does not require any special compiler flags like -mavx2
!
In order to get the best performance we recommend however to compile with
optimizations enabled e.g. -O3
or -O2
.
cc -O3 program.c
c++ -O3 program.cpp
Development
cmake .
make -j
make test
The above commands also build the benchmark
program which is
useful for benchmarking libpopcnt.h
. Below is a
usage example run on an Intel Xeon Platinum 8168 CPU from 2017:
# Usage: ./benchmark [array bytes] [iters]
./benchmark
Iters: 10000000
Array size: 16.00 KB
Algorithm: AVX512
Status: 100%
Seconds: 1.59
103.4 GB/s
Acknowledgments
The vectorized popcount algorithms used in libpopcnt.h
have
originally been written by Wojciech Muła,
I just made a convenient and portable C/C++ library using these algorithms.