• Stars
    star
    206
  • Rank 190,504 (Top 4 %)
  • Language
    Rust
  • License
    Apache License 2.0
  • Created over 8 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

General matrix multiplication of f32 and f64 matrices in Rust. Supports matrices with general strides.

matrixmultiply

General matrix multiplication for f32, f64, and complex matrices. Operates on matrices with general layout (they can use arbitrary row and column stride).

Please read the API documentation here

We presently provide a few good microkernels, portable and for x86-64 and AArch64 NEON, and only one operation: the general matrix-matrix multiplication (“gemm”).

This crate was inspired by the macro/microkernel approach to matrix multiplication that is used by the BLIS project.

crates

Development Goals

  • Code clarity and maintainability
  • Portability and stable Rust
  • Performance: provide target-specific microkernels when it is beneficial
  • Testing: Test diverse inputs and test and benchmark all microkernels
  • Small code footprint and fast compilation
  • We are not reimplementing BLAS.

Benchmarks

  • cargo bench is useful for special cases and small matrices
  • The best gemm and threading benchmark is is examples/benchmarks.rs which supports custom sizes, some configuration, and csv output. Use the script benches/benchloop.py to run benchmarks over parameter ranges.

Blog Posts About This Crate

Recent Changes

  • 0.3.7

    • Rename a directory, avoiding spaces in filenames, to be compatible with Bazel. By @xander-zitara
  • 0.3.6

    • Fix the build for the combination of cgemm and no_std (#76)
  • 0.3.5

    • Significant improvements to complex matrix packing and kernels (#75)
    • Use a specialized AVX2 matrix packing function for sgemm, dgemm when this feature is detected on x86-64
  • 0.3.4

    • Sgemm, dgemm microkernel implementations for AArch64 NEON (ARM)

      Matrixmultiply now uses autocfg to detect rust version to enable these kernels when AArch64 intrinsics are available from Rust 1.61.

    • Small change to matrix packing functions so that they in some cases optimize better due to improvements to pointer alias information.

  • 0.3.3

    • Attempt to fix macos bug #55 again (manifesting as a debug assertion, only in debug builds.)
    • Updated comments for x86 kernels by @Tastaturtaste
    • Updates to MIRI/CI by @jturner314
    • Silenced Send/Sync future compatibility warnings for a raw pointer wrapper
  • 0.3.2

    • Add optional feature cgemm for complex matmult functions cgemm and zgemm
    • Add optional feature constconf for compile-time configuration of matrix kernel parameters for chunking. Improved scripts for benchmarking over ranges of different settings. With thanks to @DutchGhost for the const-time parsing functions.
    • Improved benchmarking and testing.
    • Threading is now slightly more eager to threads (depending on matrix element count).
  • 0.3.1

    • Attempt to fix bug #55 were the mask buffer in TLS did not seem to get its requested alignment on macos. The mask buffer pointer is now aligned manually (again, like it was in 0.2.x).
    • Fix a minor issue where we were passing a buffer pointer as &T when it should have been &[T].
  • 0.3.0

    • Implement initial support for threading using a bespoke thread pool with little contention. To use, enable feature threading (and configure number of threads with the variable MATMUL_NUM_THREADS).

      Initial support is for up to 4 threads - will be updated with more experience in coming versions.

    • Added a better benchmarking program for arbitrary size and layout, see examples/benchmark.rs for this; it supports csv output for better recording of measurements

    • Minimum supported rust version is 1.41.1 and the version update policy has been updated.

    • Updated to Rust 2018 edition

    • Moved CI to github actions (so long travis and thanks for all the fish).

  • 0.2.4

    • Support no-std mode by @vadixidav and @jturner314 New (default) feature flag "std"; use default-features = false to disable and use no-std. Note that runtime CPU feature detection requires std.
    • Fix tests so that they build correctly on non-x86 #49 platforms, and manage the release by @bluss
  • 0.2.3

    • Update rawpointer dependency to 0.2
    • Minor changes to inlining for -Ctarget-cpu=native use (not recommended - use automatic runtime feature detection.
    • Minor improvements to kernel masking (#42, #41) by @bluss and @SuperFluffy
  • 0.2.2

    • New dgemm avx and fma kernels implemented by R. Janis Goldschmidt (@SuperFluffy). With fast cases for both row and column major output.

      Benchmark improvements: Using fma instructions reduces execution time on dgemm benchmarks by 25-35% compared with the avx kernel, see issue #35

      Using the avx dgemm kernel reduces execution time on dgemm benchmarks by 5-7% compared with the previous version's autovectorized kernel.

    • New fma adaption of the sgemm avx kernel by R. Janis Goldschmidt (@SuperFluffy).

      Benchmark improvement: Using fma instructions reduces execution time on sgemm benchmarks by 10-15% compared with the avx kernel, see issue #35

    • More flexible kernel selection allows kernels to individually set all their parameters, ensures the fallback (plain Rust) kernels can be tuned for performance as well, and moves feature detection out of the gemm loop.

      Benchmark improvement: Reduces execution time on various benchmarks by 1-2% in the avx kernels, see #37.

    • Improved testing to cover input/output strides of more diversity.

  • 0.2.1

    • Improve matrix packing by taking better advantage of contiguous inputs.

      Benchmark improvement: execution time for 64×64 problem where inputs are either both row major or both column major changed by -5% sgemm and -1% for dgemm. (#26)

    • In the sgemm avx kernel, handle column major output arrays just like it does row major arrays.

      Benchmark improvement: execution time for 32×32 problem where output is column major changed by -11%. (#27)

  • 0.2.0

    • Use runtime feature detection on x86 and x86-64 platforms, to enable AVX-specific microkernels at runtime if available on the currently executing configuration.

      This means no special compiler flags are needed to enable native instruction performance!

    • Implement a specialized 8×8 sgemm (f32) AVX microkernel, this speeds up matrix multiplication by another 25%.

    • Use std::alloc for allocation of aligned packing buffers

    • We now require Rust 1.28 as the minimal version

  • 0.1.15

    • Fix bug where the result matrix C was not updated in the case of a M × K by K × N matrix multiplication where K was zero. (This resulted in the output C potentially being left uninitialized or with incorrect values in this specific scenario.) By @jturner314 (PR #21)
  • 0.1.14

    • Avoid an unused code warning
  • 0.1.13

    • Pick 8x8 sgemm (f32) kernel when AVX target feature is enabled (with Rust 1.14 or later, no effect otherwise).
    • Use rawpointer, a µcrate with raw pointer methods taken from this project.
  • 0.1.12

    • Internal cleanup with retained performance
  • 0.1.11

    • Adjust sgemm (f32) kernel to optimize better on recent Rust.
  • 0.1.10

    • Update doc links to docs.rs
  • 0.1.9

    • Workaround optimization regression in rust nightly (1.12-ish) (#9)
  • 0.1.8

    • Improved docs
  • 0.1.7

    • Reduce overhead slightly for small matrix multiplication problems by using only one allocation call for both packing buffers.
  • 0.1.6

    • Disable manual loop unrolling in debug mode (quicker debug builds)
  • 0.1.5

    • Update sgemm to use a 4x8 microkernel (“still in simplistic rust”), which improves throughput by 10%.
  • 0.1.4

    • Prepare support for aligned packed buffers
    • Update dgemm to use a 8x4 microkernel, still in simplistic rust, which improves throughput by 10-20% when using AVX.
  • 0.1.3

    • Silence some debug prints
  • 0.1.2

    • Major performance improvement for sgemm and dgemm (20-30% when using AVX). Since it all depends on what the optimizer does, I'd love to get issue reports that report good or bad performance.
    • Made the kernel masking generic, which is a cleaner design
  • 0.1.1

    • Minor improvement in the kernel

More Repositories

1

indexmap

A hash table with consistent order and fast iteration; access items by key or sequence index
Rust
1,361
star
2

arrayvec

A vector with a fixed capacity. (Rust)
Rust
714
star
3

scopeguard

Rust crate for a convenient RAII scope guard.
Rust
429
star
4

maplit

Rust container / collection literal macros for HashMap, HashSet, BTreeMap, BTreeSet.
Rust
378
star
5

either

The enum Either with variants Left and Right is a general purpose sum type with two cases.
Rust
377
star
6

indexing

Sound unchecked indexing using “generativity”; a type system approach to indices, pointers and ranges that are trusted to be in bounds.
Rust
100
star
7

bencher

bencher is just a port of the libtest (unstable) benchmark runner to Rust stable releases. `cargo bench` on stable. "Not a better bencher!" = No feature development. Go build a better stable benchmarking library.
Rust
83
star
8

twoway

Twoway / Fast substring search for strings and byte strings (Rust) / Also assorted benchmarks and string search snippets
Rust
65
star
9

permutohedron

Rust
36
star
10

rustfaq

Bluss' Rust FAQ, intended to answer questions that are often encountered, and being simple to update. There's quite a lot of editing needed to get to a useful FAQ, so we'll see if we get there.
30
star
11

odds

Odds and ends — collection miscellania. Extra functionality for slices, strings and other things
Rust
21
star
12

fast-floats

Experimental (unstable) “fast-math” wrappers for f32, f64
Rust
19
star
13

pyproject-local-kernel

Python Jupyter kernel using pyproject environment manangers like Rye, Uv, PDM, Poetry, Hatch etc.
Python
15
star
14

defmac

A macro to define lambda-like macros inline.
Rust
15
star
15

ixlist

The “ixlist” is a linked list in a vector
Rust
11
star
16

itertools-num

[looking for adoption of the crate] Numerics iterator tools; cumulative sum and linearly spaced floats and such.
Rust
10
star
17

debugit

DebugIt: Use specialization so you can debug-print a value without explicit (and viral) Debug trait bound. (Not really maintained. Uses specialization so it may break)
Rust
10
star
18

thread-tree

A tree-shaped thread pool used for spawning stack-bound scoped jobs with no work stealing. Derived from a little bit of rayon-core code + new thread pool on top.
Rust
10
star
19

tox-rye

Use Rye to discover or fetch Python versions for Tox (v4)
Python
7
star
20

unchecked-index

Unchecked indexing and debug-assertion checked indexing through the regular index syntax.
Rust
7
star
21

qc.rs

quickcheck for Rust
Rust
6
star
22

simplesvg

Very simple drawing/diagramming library with svg output (Rust).
Rust
6
star
23

virtualenv-rye-discovery

Virtualenv plugin for discovery of Python interpreters through Rye.
Python
5
star
24

ryeish-kernel

Python Jupyter kernel using Rye for reproducible notebooks
Python
5
star
25

knuth-morris-pratt

Generic substring search for sequences. Requires elements that are equality comparable.
Rust
5
star
26

galil-seiferas

Galil-Seiferas algorithm: String search in constant space, linear time, for nonorderable alphabets
Rust
5
star
27

rawpointer

Extra methods for raw pointers
Rust
4
star
28

dlx

Dancing Links (“dlx”) solver for the exact cover problem, written in Rust. Can be used to create a sudoku solver.
Rust
4
star
29

rpasha

Restrictive Precise Angle Shadowcasting, a field of view algorithm for a discrete playing field.
Rust
4
star
30

rawslice

Rust
2
star
31

siphash-ppc-altivec

C
2
star
32

asprim

Rust library with a trait for the `as` operator for primitive numeric types.
Rust
2
star
33

blake2-ppc-altivec

BLAKE2s hash function https://blake2.net PowerPC/Altivec implementation by Ulrik Sverdrup
C
2
star
34

trait-group

Rust
2
star
35

blis-sys

[Feel free to adopt/fork this repo!] Experimental Rust bindings for BLIS
Rust
2
star
36

bmap

Experimental B-tree map in Rust using ArrayVec
Rust
1
star
37

rand_derive

(Obsolete) derive(Rand) for use with custom_derive!{}. Works with stable Rust 1.8+
Rust
1
star
38

numeric-loops

Rust
1
star
39

complexfloat

Extended traits for Rust Float and Complex types. Requires Rust 1.8.
Rust
1
star
40

bluss.github.io

HTML
1
star