• Stars
    star
    189
  • Rank 204,649 (Top 5 %)
  • Language
    Assembly
  • Created almost 7 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Optimized functions for Go using SIMD

go-simd

Make certain functions Go faster with SIMD, loop unrolling, c2goasm or other optimization techniques.

This package chooses the most appropriate implementation at runtime, based on the host CPU features, however, it is possible to disable certain implementations using the INTEL_DISABLE_EXT environment variable. See the cpu package README for a description of this environment variable.

Benchmarks

SumFloat64

Benchmark various sum implementations, aggregating 1000 and 10000 element slices of float64 values.

  • Intrinsics uses handwritten AVX intrinsics via clang
  • AVX2 uses plain C code, exploiting auto-vectorization and AVX2 architecture enabled via clang
  • SSE4 uses plain C code, exploiting auto-vectorization and SSE4 architecture enabled via clang
  • Go is an equivalent loop in Go
    • Unroll4 and Unroll8 are unrolled versions
BenchmarkSumFloat64_1000-8                   20000000          59 ns/op    134057.61 MB/s
BenchmarkSumFloat64_10000-8                   2000000         842 ns/op     94949.30 MB/s
BenchmarkSumFloat64_Intrinsics_1000-8         5000000         245 ns/op     32550.11 MB/s
BenchmarkSumFloat64_Intrinsics_10000-8         500000        2913 ns/op     27460.17 MB/s
BenchmarkSumFloat64_AVX2_1000-8              30000000          56 ns/op    142336.45 MB/s
BenchmarkSumFloat64_AVX2_10000-8              2000000         847 ns/op     94426.99 MB/s
BenchmarkSumFloat64_SSE4_1000-8               5000000         277 ns/op     28806.44 MB/s
BenchmarkSumFloat64_SSE4_10000-8               500000        2903 ns/op     27556.33 MB/s
BenchmarkSumFloat64_Go_1000-8                 1000000        1124 ns/op      7116.81 MB/s
BenchmarkSumFloat64_Go_10000-8                 200000       11583 ns/op      6906.38 MB/s
BenchmarkSumFloat64_GoUnroll4_1000-8          5000000         287 ns/op     27790.03 MB/s
BenchmarkSumFloat64_GoUnroll4_10000-8          500000        2896 ns/op     27616.44 MB/s
BenchmarkSumFloat64_GoUnroll8_1000-8         10000000         188 ns/op     42341.91 MB/s
BenchmarkSumFloat64_GoUnroll8_10000-8          500000        2924 ns/op     27358.12 MB/s

unicode/utf8.Valid

Provide a fast implementation of utf8.Valid using SSE and AVX2 functions. Credit for these SIMD implementations go to Daniel Lemire.

Read this post for more information on these SIMD optimized functions.

BenchmarkValid/utf8.Valid/ASCII/100-8          20000000            79 ns/op    1257.68 MB/s
BenchmarkValid/utf8.Valid/ASCII/10000-8          200000          6140 ns/op    1628.48 MB/s
BenchmarkValid/utf8.Valid/ASCII/1000000-8          2000        608369 ns/op    1643.74 MB/s
BenchmarkValid/utf8.Valid/UTF8/100-8           10000000           139 ns/op     724.09 MB/s
BenchmarkValid/utf8.Valid/UTF8/10000-8            50000         32722 ns/op     305.60 MB/s
BenchmarkValid/utf8.Valid/UTF8/1000000-8            500       3953426 ns/op     252.95 MB/s
BenchmarkValid/sse4.Valid/UTF8/100-8           30000000            43 ns/op    2311.65 MB/s
BenchmarkValid/sse4.Valid/UTF8/10000-8           500000          2436 ns/op    4104.65 MB/s
BenchmarkValid/sse4.Valid/UTF8/1000000-8          10000        243250 ns/op    4110.98 MB/s
BenchmarkValid/sse4.Valid/ASCII/100-8          30000000            43 ns/op    2294.62 MB/s
BenchmarkValid/sse4.Valid/ASCII/10000-8          500000          2439 ns/op    4099.68 MB/s
BenchmarkValid/sse4.Valid/ASCII/1000000-8          5000        246138 ns/op    4062.75 MB/s
BenchmarkValid/avx2.Valid/ASCII/100-8          50000000            24 ns/op    4042.96 MB/s
BenchmarkValid/avx2.Valid/ASCII/10000-8         5000000           256 ns/op   39043.62 MB/s
BenchmarkValid/avx2.Valid/ASCII/1000000-8         50000         30786 ns/op   32481.66 MB/s
BenchmarkValid/avx2.Valid/UTF8/100-8           50000000            35 ns/op    2864.81 MB/s
BenchmarkValid/avx2.Valid/UTF8/10000-8          1000000          1440 ns/op    6943.45 MB/s
BenchmarkValid/avx2.Valid/UTF8/1000000-8          10000        142939 ns/op    6995.97 MB/s

encoding/ascii.Valid

A fast implementation for determining if a buffer is valid ASCII data. Credit for SIMD implementations go to Daniel Lemire.

BenchmarkValid/go.Valid/100-8         20000000          52 ns/op     1911.59 MB/s
BenchmarkValid/go.Valid/10000-8         500000        3048 ns/op     3280.27 MB/s
BenchmarkValid/go.Valid/1000000-8         5000      303508 ns/op     3294.80 MB/s
BenchmarkValid/sse4.Valid/100-8      100000000          11 ns/op     8674.49 MB/s
BenchmarkValid/sse4.Valid/10000-8      5000000         379 ns/op    26379.43 MB/s
BenchmarkValid/sse4.Valid/1000000-8      50000       37061 ns/op    26982.04 MB/s
BenchmarkValid/avx2.Valid/100-8      200000000           8 ns/op    12437.12 MB/s
BenchmarkValid/avx2.Valid/10000-8     10000000         137 ns/op    72718.12 MB/s
BenchmarkValid/avx2.Valid/1000000-8     100000       17767 ns/op    56280.99 MB/s

More Repositories

1

iCade-iOS

MIT-licensed iOS SDK for iCade and iControlPad, including sample application
Objective-C
124
star
2

c64iphone

Commodore 64 for iPhone
Objective-C
46
star
3

gridfsfusepy

A FUSE filesystem for GridFS written in Python
Python
23
star
4

lua-snowflake

An implementation of Snowflake for Lua. Snowflake is an algorithm which supports ordered, distributed id generation
C
19
star
5

vice-emu

Fork of VICE emulator from Source Forge
C
12
star
6

SwiftSPIRV-Cross

Elegant bindings to glslang and SPIRV-Cross
C++
12
star
7

autoingest

Utility to facilitate automatic retrieval of iTunes Connect reports.
Python
8
star
8

sharp-c64

Port of the Frodo Commodore 64 emulator to C#
C#
7
star
9

iphone-sdk-examples

various examples using the iPhone SDK
6
star
10

REminiscence-iphone

REminiscence port to the iPhone
C
5
star
11

clang

Fork of the Clang compiler with my experiments; check NSURL-literal branch
C++
5
star
12

toml-plugin

IntelliJ plugin for TOML
Kotlin
5
star
13

cl-copy-paste

Command line utilities to provide copy and paste using the system Clipboard
C#
5
star
14

Mono.TextTemplating

A fork of the Mono T4 engine
C#
4
star
15

jstrace

Firebug extension for tracing executing javascript code are generate call tree
JavaScript
3
star
16

time-tracker-mac

Time Tracker for Mac - track time spent on projects and tasks (clone)
Objective-C
3
star
17

cyclone68000

Cyclone 68000 emulator
C++
3
star
18

sdl-mini

Lean version of SDL for specific mobile projects. Very much a WIP right now.
C
2
star
19

nhcontrib

Clone of the NHContrib's Subversion repository on sourceforge.net
C#
2
star
20

gikit

Toolkit for abstracting input devices such as joysticks for iPhone OS related products
2
star
21

raytracing

Raytracing in a weekend, week and life
Swift
2
star
22

iAmiga

iAmiga runtime
C
2
star
23

pivotalEnhanced

Safari extension to add markdown support to pivotaltracker.com
JavaScript
1
star
24

lua-mesos

Lua framework for Apache Mesos
C++
1
star
25

GENie-examples

Examples for the GENie project generator tool
Lua
1
star
26

tidy2

Simple binding to libtidy; based off original tidy bindings for node by Martyn Garcia
C++
1
star
27

nhibernate-test

Simple test framework to submit bugs for NHibernate 3.2+
C#
1
star
28

dotfiles

My dotfiles
Shell
1
star
29

sgc_general

My textmate bundle of common commands
1
star
30

vim-openssl

pathogen compatible version of openssl.vim
Vim Script
1
star