CRC
Fastest CRC32 for x86, Intel and AMD, + comprehensive derivation and discussion of various approachesRGB2Y
Fastest CPU (AVX/SSE) RGB to grayscale: 2-4x faster than OpenCV. For image processing/computer vision.KFAST
Implementation of FAST feature detector for computer vision (Rosten 2006) using AVX2 to outperform canonical implementation by up to 600%.SortingNetworks
Fastest CPU SIMD (SSE4) sorting networks for small integer arrays (2-6 elements), also optimal amd64 assembly and notes on getting compilers to generate optimal sorting networks.KORAL
Novel extreme-performance CPU-GPU cooperative feature detector-descriptor for computer vision.FastArrayOps
Extremely fast x86 / AVX2 assembly implementations of common operations for linear arrays: checking whether array contains element, finding index of element, finding min/max element, finding index of min/max element.LATCH
Fastest CPU implementation of the LATCH 512-bit binary feature descriptor; fully scale- and rotation-invariantCLATCH
Insanely fast CUDA LATCH: fully scale- and rotation-invariant 512-bit binary descriptor for computer visionCUDAKfNN
Fastest CUDA SIFT or other 128-float vector matcher for computer visionKPS
Infrastructure for simultaneous orbital and attitude propagation, with attitude-based real-time analytical aerodynamics simulationFastDivide
Divide 64-bit integers faster than hardware. Or precompute for a given denom and quickly divide repeatedly.KLERP
Fastest CPU (AVX2) Bilinear and Nearest-Neighbor Interpolation: 25-100% faster than OpenCV. For computer vision / image processing.CUDAK2NN
Insanely fast CUDA 2NN 512-bit binary descriptor matcher for computer visionCUDARGB2Y
Fastest CUDA RGB to grayscale: 5-30x faster than OpenCV. For image processing/computer vision.KNES
Complete, lightweight NES emulator in C++, speedcoded in 3 days.KfNN
Fastest CPU (AVX/SSE) SIFT or other 128-float vector matcher for computer visionCUDALERP
Fast CUDA (GPU) Bilinear and Nearest-Neighbor Interpolation at high accuracy - uint8_t dataBoxBlur
Fastest CPU (AVX/SSE) Horizontal Box Blur for image processing and computer visionK2NN
Fast bruteforce and Multi-Index Hash (MIH) accelerated 2NN matchers for 512-bit binary descriptors for computer visionCUDAHammingMean
Fastest GPU implementation of a brute-force Hamming-weight matrix sum/mean for 512-bit binary descriptors.ULATCH
Fastest CPU implementation of the LATCH 512-bit binary feature descriptor for computer vision (upright)CUDAFLERP
Fast CUDA (GPU) Bilinear and Nearest-Neighbor Interpolation at high accuracy - float32 dataFastThreadPool
Fast lock-free thread poolUCLATCH
Insanely fast CUDA LATCH 512-bit binary descriptor for computer vision (upright)FastIntegerSqrt
Fastest implementations of 32-bit and 64-bit integer square roots for x86-64FeatureAngle
Extremely fast SSE gradient (angle of rotation) computation of grayscale features in an image, for image processing and computer vision.popcount
Fastest possible x86 implementation of popcount/population count/Hamming weight/counting set bitsBitOps
Basic, efficient, header-only bit ops and bit array primitives for modern x86. Tests provided.MATLAB-KDrag
Orbital and attitude propagator with B-dot and *dynamic* aerodynamic drag simulation, including torque computation for aero-stabilized bodies.CUDAKfNN_packed
Fastest CUDA SIFT or other 128-float *packed as uint8_t* vector matcher for computer visionEllipticCurveFactorization
Fast, single-file, MIT-licensed large integer factorization using ECM combined with other techniques.PyCruiseControl
Modified divorced PID controller applied to car cruise control and accompanying physics simulation and visualizationsArduinoPhysics
Realtime 2D physics and collision detection on an Arduino with 60 fps output to a Sharp memory LCD.MemoryOrder
Demos of 3 ways even the strong memory model of x86 can exhibit architectural memory reordering, leading to bugsPrimeSieve
Super fast, dynamically expanding prime sieve for primality queries, forward or backward iterationModularSqrt
Fast modular square root of primes and prime powers, including 2. Interface uses GMP bigints.KFAST_OpenMVG
Custom version of KFAST for integration into OpenMVGsmart_tm
a smart, leap-second- and leap-day-aware, fast, 64-bit-capable replacement for the ctime 'tm' structKHALF
Optimized special-case bilinear interpolation, halving the width and not changing the height, for computer vision dual-frame display.MATLABCruiseControl
Modified divorced PID controller applied to car cruise control and accompanying physics simulation and visualizations - MATLAB portFactorization-Primality
Extremely fast, single-file factorization and primality testing for 32-bit and 64-bit integers on x86.SMC-Demo
Minimal demo of self-modifying code on Windows. Still doable, still useful.UnsignedIntegralToFloatingPoint
Notes on fast standards-compliant conversion of U32/U64 to and from float/double, which compilers do not get right.SingleLinePythonSudoku
Single-line Python Sudoku solverBoids_SDL
Numerical simulation of flocking behavior using pure CPU and SDL.Sudoku
Fast sudoku solver with detection of no solution/single solution/multiple solutions/invalid initial boardSolveModularQuadratic
Generate all solutions to a modular quadratic equation. Supports any modulus. Interface uses GMP bigints.CudaBoids
Numerical simulation of flocking behavior using CUDA and OpenGLSchematic
Basic toy Lisp interpreter in a few hundred lines of C++.Leftpack
Fast AVX2 leftpack/compress implementations (keep and contiguously pack a subset of elements)U128
Fast unsigned 128-bit integer class for MSVC since it doesn't natively support __uint128_t yetFastDivide128
Getting __udivti3 or __umodti3 errors? Just want faster division/modulo for 128-bit ints on Clang? Look no further.Love Open Source and this site? Check out how you can help us