Discover @komrad36 Open Source projects

Kareem Omar (@komrad36)

komrad36

Stars
885
Global Rank 34,458 (Top 2 %)
Followers 230
Following 4
Registered over 9 years ago
Most used languages

C++
84.0 %

Assembly
6.0 %

MATLAB
4.0 %

Python
2.0 %

C
2.0 %

Cuda 2.0 %
Location 🇺🇸 United States
Country Total Rank 10,177
Country Ranking

Assembly
120

Cuda 418

C++
585

MATLAB
1,950

C
2,511

CRC

Fastest CRC32 for x86, Intel and AMD, + comprehensive derivation and discussion of various approaches

RGB2Y

Fastest CPU (AVX/SSE) RGB to grayscale: 2-4x faster than OpenCV. For image processing/computer vision.

KFAST

Implementation of FAST feature detector for computer vision (Rosten 2006) using AVX2 to outperform canonical implementation by up to 600%.

SortingNetworks

Fastest CPU SIMD (SSE4) sorting networks for small integer arrays (2-6 elements), also optimal amd64 assembly and notes on getting compilers to generate optimal sorting networks.

KORAL

Novel extreme-performance CPU-GPU cooperative feature detector-descriptor for computer vision.

FastArrayOps

Extremely fast x86 / AVX2 assembly implementations of common operations for linear arrays: checking whether array contains element, finding index of element, finding min/max element, finding index of min/max element.

LATCH

Fastest CPU implementation of the LATCH 512-bit binary feature descriptor; fully scale- and rotation-invariant

CLATCH

Insanely fast CUDA LATCH: fully scale- and rotation-invariant 512-bit binary descriptor for computer vision

CUDAKfNN

Fastest CUDA SIFT or other 128-float vector matcher for computer vision

KPS

Infrastructure for simultaneous orbital and attitude propagation, with attitude-based real-time analytical aerodynamics simulation

FastDivide

Divide 64-bit integers faster than hardware. Or precompute for a given denom and quickly divide repeatedly.

KLERP

Fastest CPU (AVX2) Bilinear and Nearest-Neighbor Interpolation: 25-100% faster than OpenCV. For computer vision / image processing.

CUDAK2NN

Insanely fast CUDA 2NN 512-bit binary descriptor matcher for computer vision

CUDARGB2Y

Fastest CUDA RGB to grayscale: 5-30x faster than OpenCV. For image processing/computer vision.

KNES

Complete, lightweight NES emulator in C++, speedcoded in 3 days.

KfNN

Fastest CPU (AVX/SSE) SIFT or other 128-float vector matcher for computer vision

CUDALERP

Fast CUDA (GPU) Bilinear and Nearest-Neighbor Interpolation at high accuracy - uint8_t data

BoxBlur

Fastest CPU (AVX/SSE) Horizontal Box Blur for image processing and computer vision

K2NN

Fast bruteforce and Multi-Index Hash (MIH) accelerated 2NN matchers for 512-bit binary descriptors for computer vision

CUDAHammingMean

Fastest GPU implementation of a brute-force Hamming-weight matrix sum/mean for 512-bit binary descriptors.

ULATCH

Fastest CPU implementation of the LATCH 512-bit binary feature descriptor for computer vision (upright)

CUDAFLERP

Fast CUDA (GPU) Bilinear and Nearest-Neighbor Interpolation at high accuracy - float32 data

FastThreadPool

Fast lock-free thread pool

UCLATCH

Insanely fast CUDA LATCH 512-bit binary descriptor for computer vision (upright)

FastIntegerSqrt

Fastest implementations of 32-bit and 64-bit integer square roots for x86-64

FeatureAngle

Extremely fast SSE gradient (angle of rotation) computation of grayscale features in an image, for image processing and computer vision.

popcount

Fastest possible x86 implementation of popcount/population count/Hamming weight/counting set bits

BitOps

Basic, efficient, header-only bit ops and bit array primitives for modern x86. Tests provided.

MATLAB-KDrag

Orbital and attitude propagator with B-dot and *dynamic* aerodynamic drag simulation, including torque computation for aero-stabilized bodies.

CUDAKfNN_packed

Fastest CUDA SIFT or other 128-float *packed as uint8_t* vector matcher for computer vision

EllipticCurveFactorization

Fast, single-file, MIT-licensed large integer factorization using ECM combined with other techniques.

PyCruiseControl

Modified divorced PID controller applied to car cruise control and accompanying physics simulation and visualizations

ArduinoPhysics

Realtime 2D physics and collision detection on an Arduino with 60 fps output to a Sharp memory LCD.

MemoryOrder

Demos of 3 ways even the strong memory model of x86 can exhibit architectural memory reordering, leading to bugs

PrimeSieve

Super fast, dynamically expanding prime sieve for primality queries, forward or backward iteration

ModularSqrt

Fast modular square root of primes and prime powers, including 2. Interface uses GMP bigints.

KFAST_OpenMVG

Custom version of KFAST for integration into OpenMVG

smart_tm

a smart, leap-second- and leap-day-aware, fast, 64-bit-capable replacement for the ctime 'tm' struct

KHALF

Optimized special-case bilinear interpolation, halving the width and not changing the height, for computer vision dual-frame display.

MATLABCruiseControl

Modified divorced PID controller applied to car cruise control and accompanying physics simulation and visualizations - MATLAB port

Factorization-Primality

Extremely fast, single-file factorization and primality testing for 32-bit and 64-bit integers on x86.

SMC-Demo

Minimal demo of self-modifying code on Windows. Still doable, still useful.

UnsignedIntegralToFloatingPoint

Notes on fast standards-compliant conversion of U32/U64 to and from float/double, which compilers do not get right.

SingleLinePythonSudoku

Single-line Python Sudoku solver

Boids_SDL

Numerical simulation of flocking behavior using pure CPU and SDL.

Sudoku

Fast sudoku solver with detection of no solution/single solution/multiple solutions/invalid initial board

SolveModularQuadratic

Generate all solutions to a modular quadratic equation. Supports any modulus. Interface uses GMP bigints.

CudaBoids

Numerical simulation of flocking behavior using CUDA and OpenGL

Schematic

Basic toy Lisp interpreter in a few hundred lines of C++.

Leftpack

Fast AVX2 leftpack/compress implementations (keep and contiguously pack a subset of elements)

U128

Fast unsigned 128-bit integer class for MSVC since it doesn't natively support __uint128_t yet

FastDivide128

Getting __udivti3 or __umodti3 errors? Just want faster division/modulo for 128-bit ints on Clang? Look no further.