Kareem Omar (@komrad36)
  • Stars
    star
    885
  • Global Rank 34,457 (Top 2 %)
  • Followers 230
  • Following 4
  • Registered over 9 years ago
  • Most used languages
    C++
    84.0 %
    Assembly
    6.0 %
    MATLAB
    4.0 %
    C
    2.0 %
    Python
    2.0 %
    Cuda
    2.0 %
  • Location πŸ‡ΊπŸ‡Έ United States
  • Country Total Rank 10,177
  • Country Ranking
    Assembly
    120
    Cuda
    500
    C++
    585
    MATLAB
    2,003
    C
    2,517

Top repositories

1

CRC

Fastest CRC32 for x86, Intel and AMD, + comprehensive derivation and discussion of various approaches
C++
219
star
2

RGB2Y

Fastest CPU (AVX/SSE) RGB to grayscale: 2-4x faster than OpenCV. For image processing/computer vision.
C++
89
star
3

KFAST

Implementation of FAST feature detector for computer vision (Rosten 2006) using AVX2 to outperform canonical implementation by up to 600%.
C
74
star
4

SortingNetworks

Fastest CPU SIMD (SSE4) sorting networks for small integer arrays (2-6 elements), also optimal amd64 assembly and notes on getting compilers to generate optimal sorting networks.
Assembly
42
star
5

KORAL

Novel extreme-performance CPU-GPU cooperative feature detector-descriptor for computer vision.
C++
38
star
6

FastArrayOps

Extremely fast x86 / AVX2 assembly implementations of common operations for linear arrays: checking whether array contains element, finding index of element, finding min/max element, finding index of min/max element.
Assembly
36
star
7

LATCH

Fastest CPU implementation of the LATCH 512-bit binary feature descriptor; fully scale- and rotation-invariant
C++
34
star
8

CLATCH

Insanely fast CUDA LATCH: fully scale- and rotation-invariant 512-bit binary descriptor for computer vision
C++
32
star
9

CUDAKfNN

Fastest CUDA SIFT or other 128-float vector matcher for computer vision
C++
25
star
10

KPS

Infrastructure for simultaneous orbital and attitude propagation, with attitude-based real-time analytical aerodynamics simulation
C++
23
star
11

FastDivide

Divide 64-bit integers faster than hardware. Or precompute for a given denom and quickly divide repeatedly.
C++
22
star
12

KLERP

Fastest CPU (AVX2) Bilinear and Nearest-Neighbor Interpolation: 25-100% faster than OpenCV. For computer vision / image processing.
C++
19
star
13

CUDAK2NN

Insanely fast CUDA 2NN 512-bit binary descriptor matcher for computer vision
C++
14
star
14

CUDARGB2Y

Fastest CUDA RGB to grayscale: 5-30x faster than OpenCV. For image processing/computer vision.
C++
14
star
15

KNES

Complete, lightweight NES emulator in C++, speedcoded in 3 days.
C++
14
star
16

KfNN

Fastest CPU (AVX/SSE) SIFT or other 128-float vector matcher for computer vision
C++
13
star
17

CUDALERP

Fast CUDA (GPU) Bilinear and Nearest-Neighbor Interpolation at high accuracy - uint8_t data
C++
12
star
18

BoxBlur

Fastest CPU (AVX/SSE) Horizontal Box Blur for image processing and computer vision
C++
10
star
19

K2NN

Fast bruteforce and Multi-Index Hash (MIH) accelerated 2NN matchers for 512-bit binary descriptors for computer vision
C++
10
star
20

CUDAHammingMean

Fastest GPU implementation of a brute-force Hamming-weight matrix sum/mean for 512-bit binary descriptors.
C++
9
star
21

ULATCH

Fastest CPU implementation of the LATCH 512-bit binary feature descriptor for computer vision (upright)
C++
9
star
22

CUDAFLERP

Fast CUDA (GPU) Bilinear and Nearest-Neighbor Interpolation at high accuracy - float32 data
C++
9
star
23

FastThreadPool

Fast lock-free thread pool
C++
8
star
24

UCLATCH

Insanely fast CUDA LATCH 512-bit binary descriptor for computer vision (upright)
C++
8
star
25

FastIntegerSqrt

Fastest implementations of 32-bit and 64-bit integer square roots for x86-64
C++
7
star
26

FeatureAngle

Extremely fast SSE gradient (angle of rotation) computation of grayscale features in an image, for image processing and computer vision.
C++
7
star
27

popcount

Fastest possible x86 implementation of popcount/population count/Hamming weight/counting set bits
C++
6
star
28

BitOps

Basic, efficient, header-only bit ops and bit array primitives for modern x86. Tests provided.
C++
6
star
29

MATLAB-KDrag

Orbital and attitude propagator with B-dot and *dynamic* aerodynamic drag simulation, including torque computation for aero-stabilized bodies.
MATLAB
6
star
30

CUDAKfNN_packed

Fastest CUDA SIFT or other 128-float *packed as uint8_t* vector matcher for computer vision
C++
5
star
31

EllipticCurveFactorization

Fast, single-file, MIT-licensed large integer factorization using ECM combined with other techniques.
C++
5
star
32

PyCruiseControl

Modified divorced PID controller applied to car cruise control and accompanying physics simulation and visualizations
Python
5
star
33

ArduinoPhysics

Realtime 2D physics and collision detection on an Arduino with 60 fps output to a Sharp memory LCD.
C++
5
star
34

MemoryOrder

Demos of 3 ways even the strong memory model of x86 can exhibit architectural memory reordering, leading to bugs
C++
5
star
35

PrimeSieve

Super fast, dynamically expanding prime sieve for primality queries, forward or backward iteration
C++
4
star
36

ModularSqrt

Fast modular square root of primes and prime powers, including 2. Interface uses GMP bigints.
C++
4
star
37

KFAST_OpenMVG

Custom version of KFAST for integration into OpenMVG
C++
4
star
38

smart_tm

a smart, leap-second- and leap-day-aware, fast, 64-bit-capable replacement for the ctime 'tm' struct
C++
3
star
39

KHALF

Optimized special-case bilinear interpolation, halving the width and not changing the height, for computer vision dual-frame display.
C++
3
star
40

MATLABCruiseControl

Modified divorced PID controller applied to car cruise control and accompanying physics simulation and visualizations - MATLAB port
MATLAB
3
star
41

Factorization-Primality

Extremely fast, single-file factorization and primality testing for 32-bit and 64-bit integers on x86.
C++
3
star
42

SMC-Demo

Minimal demo of self-modifying code on Windows. Still doable, still useful.
Assembly
3
star
43

UnsignedIntegralToFloatingPoint

Notes on fast standards-compliant conversion of U32/U64 to and from float/double, which compilers do not get right.
3
star
44

SingleLinePythonSudoku

Single-line Python Sudoku solver
2
star
45

Boids_SDL

Numerical simulation of flocking behavior using pure CPU and SDL.
C++
2
star
46

Sudoku

Fast sudoku solver with detection of no solution/single solution/multiple solutions/invalid initial board
C++
2
star
47

SolveModularQuadratic

Generate all solutions to a modular quadratic equation. Supports any modulus. Interface uses GMP bigints.
C++
2
star
48

CudaBoids

Numerical simulation of flocking behavior using CUDA and OpenGL
Cuda
2
star
49

Schematic

Basic toy Lisp interpreter in a few hundred lines of C++.
C++
2
star
50

Leftpack

Fast AVX2 leftpack/compress implementations (keep and contiguously pack a subset of elements)
C++
1
star
51

U128

Fast unsigned 128-bit integer class for MSVC since it doesn't natively support __uint128_t yet
C++
1
star
52

FastDivide128

Getting __udivti3 or __umodti3 errors? Just want faster division/modulo for 128-bit ints on Clang? Look no further.
C++
1
star