Top Rating
- Top Contributors
  Discover the Top Open Source contributors by country or by language
- Interviews
  Discover real stories from Open Source developers
Discover

Discover your Favorite Language
Discover the top trending repositories and projects on Github. Explore the latest trends in your preferred languages.

Lua

Scala

Emacs Lisp

Java

TypeScript

Objective-C

Go

PHP

More Languages
Awesome

Awesome repositories
Discover the most awesome repositories and projects of your favorite languages. Inspired by the Awesome-* lists trend in GitHub.

TypeScript

Ada

Scala

C

C#

Elixir

Kotlin

Groovy

More Languages
By Country

Rankings by Country
Discover the community of talented open source contributors in each country.

🇧🇮 Burundi

🇦🇿 Azerbaijan

🇳🇱 Netherlands

🇮🇶 Iraq

🇨🇦 Canada

🇲🇿 Mozambique

🇹🇭 Thailand

🇾🇪 Yemen

All Countries Compare Countries

HazyResearch/blocking-tutorial

Stars
132
Rank 274,205 (Top 6 %)
Language
C++
Created over 9 years ago
Updated over 1 year ago

HazyResearch/blocking-tutorial

HazyResearch

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Blocking Tutorial

This is a tutorial internal to the Hazy Research group to illustrate SIMD and cache blocking.

Based on the paper: Anatomy of High-Performance Many-Threaded Matrix Multiplication http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf

Requirements

To run the code, you will need:

A Haswell processor or later (AVX2 and FMA support) http://en.wikipedia.org/wiki/Advanced_Vector_Extensions http://en.wikipedia.org/wiki/FMA_instruction_set
The OpenMP library
I used g++ 4.9.2 but any compiler supporting AVX2 and FMA will work
OpenBLAS (for comparison vs. OpenBLAS. Disable if not interested) http://www.openblas.net/

Instructions

To compile:

Change compile.sh to point to OpenBLAS or some other BLAS To run without OpenBLAS, modify the .cpp file to not call OpenBLAS (remove references to cblas.h, cblas_sgemm)
Compile with: bash compile.sh
Run with: ./matmul

This will run two of the matrix multiply implementations and compare:

The outputs (to ensure they match)
The total time and GFLOPS

To change which version to compare against, edit which function is called (sgemm_naive, sgemm_16x6_block_parallel, etc.)

Experiments

For NxN single precision matrix multiplication on a Haswell with 2 cores, 4 threads, L1 I$/D$: 2 x 32 kB 8-way L2$ 2 x 256 kB 8-way L3$ 1 x 4 MB 16-way

Naive       8330.310 milliseconds GFlops=1.699310
+SIMD       1057.134 milliseconds GFlops=13.390711
+Blocking   273.5390 milliseconds GFlops=51.750485
+Threads    138.5209 milliseconds GFlops=102.192277
OpenBLAS    117.7092 milliseconds GFlops=120.259758

For various N:

N=960 (~1 million elements per matrix):
This code   101.2 GFLOPS    
OpenBLAS    98.3 GFLOPS     

N=1920 (~4 million elements per matrix):
This code   104.1 GFLOPS    
OpenBLAS    121.2 GFLOPS    

N=2880 (~8 million elements per matrix):
This code   91.7 GFLOPS
OpenBLAS    103.9 GFLOPS

flash-attention

Fast and memory-efficient exact attention

deepdive

ThunderKittens

Tile primitives for speedy kernels

state-spaces

Sequence Modeling with Structured State Spaces

Jupyter Notebook

data-centric-ai

Resources for Data Centric AI

safari

Convolutions for Sequence Modeling

meerkat

Creative interactive views of any dataset.

hgcn

Hyperbolic Graph Convolutional Networks in PyTorch.

hyena-dna

Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena

ama_prompting

Ask Me Anything language model prompting

m2

Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"

H3

Language Modeling with the H3 State Space Model

evaporate

This repo contains data and code for the paper "Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes"

manifest

Prompt programming with FMs.

pdftotree

🌲 A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.

metal

Snorkel MeTaL: A framework for training models with multi-task weak supervision

fonduer

A knowledge base construction engine for richly formatted data

aisys-building-blocks

Building blocks for foundation models.

hyperbolics

Hyperbolic Embeddings

legalbench

An open science effort to benchmark legal reasoning in foundation models

flyingsquid

More interactive weak supervision with FlyingSquid

flash-fft-conv

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

KGEmb

Hyperbolic Knowledge Graph embeddings.

bootleg

Self-Supervision for Named Entity Disambiguation at the Tail

based

Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"

HypHC

Hyperbolic Hierarchical Clustering.

fly

TART

TART: A plug-and-play Transformer module for task-agnostic reasoning

tanda

Learning to Compose Domain-Specific Transformations for Data Augmentation

hippo-code

butterfly

Butterfly matrix multiplication in PyTorch

spacetime

Code for SpaceTime 🌌⏱️. Proposed in Effectively Modeling Time Series with Simple Discrete State Spaces, ICLR 2023.

zoology

Understand and test language model architectures on synthetic tasks.

lolcats

Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"

babble

A system for generating training labels via natural language explanations

EmptyHeaded

Your worst case is our best case.

domino

mindbender

Tools for iterative knowledge base development with DeepDive

reef

Automatically labeling training data

Jupyter Notebook

fm_data_tasks

Foundation Models for Data Tasks

fonduer-tutorials

A collection of simple tutorials for using Fonduer

Jupyter Notebook

eclair-agents

Automating enterprise workflows with multimodal agents

Jupyter Notebook

TreeStructure

Table Extraction Tool

Jupyter Notebook

CaffeConTroll

epoxy

Interactive Model Iteration with Weak Supervision and Pre-Trained Embeddings

HoroPCA

Hyperbolic PCA via Horospherical Projections

structured-nets

Structured matrices for compressing neural networks

hidden-stratification

Combating hidden stratification with GEORGE

Jupyter Notebook

numbskull

Numba-based version of DimmWitted Gibbs sampler

prefix-linear-attention

model-patching

Model Patching: Closing the Subgroup Performance Gap with Data Augmentation

skill-it

Skill-It! A Data-Driven Skills Framework for Understanding and Training Language Models

Jupyter Notebook

cs145-notebooks-2016

Public materials for the Fall 2016 offering of CS145

Jupyter Notebook

mandoline

(ICML 2021) Mandoline: Model Evaluation under Distribution Shift

mongoose

A Learnable LSH Framework for Efficient NN Training

thanos-code

Code release for the paper Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning

ukb-cardiac-mri

Weakly Supervised MRI Series Classification for the UK Biobank

tuffy

Tuffy, a Markov Logic Network solver

snorkel-superglue

Applying Snorkel to SuperGLUE

Jupyter Notebook

correct-n-contrast

Official code repository for Correct-N-Contrast

ludwig-benchmarking-toolkit

Ludwig benchmark

smallfry

tabi

Code release for Type-Aware Bi-Encoders for Open-Domain Entity Retrieval

lp_rffs

Low precision random Fourier features for kernel approximation

ddlog

Compiler for writing DeepDive applications in a Datalog-like language — ⚠️🚧🛑 REPO MOVED TO DEEPDIVE 👇🏿

wonderbread

WONDERBREAD benchmark + dataset for BPM tasks

Jupyter Notebook

augmentation_code

Reproducible code for Augmentation paper

sampler

DimmWitted Gibbs Sampler in C++ — ⚠️🚧🛑 REPO MOVED TO DEEPDIVE 👉🏿

random_embedding

snorkel-biocorpus

ddbiolib

DeepDive Biomedical Tools

bazaar

Omnivore

Omnivore Optimizer and Distributed CcT

anchor-stability

A study of the downstream instability of word embeddings

Jupyter Notebook

medical-ned-integration

Cross-domain data integration for named entity disambiguation in biomedical text

dd-genomics

The Genomics DeepDive project

embroid

Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification

Jupyter Notebook

torchhalp

dimmwitted

Accelerated-PCA

Accelerated Stochastic Power Iteration with Momentum

Jupyter Notebook

liger

Liger: Fusing Weak Supervision and Model Embeddings

cross-modal-ws-demo

hyperE

treedlib

Jupyter Notebook

ivy-tutorial

An Introductory Tutorial for Ivy

Jupyter Notebook

observational

Observational Supervision for Medical Image Classification using Gaze Data

Jupyter Notebook

chinstrap

quadrature-features

Code to generate kernel features using Gaussian quadrature

icij-maude

Weakly supervised classification of adverse event reports from the FDA's MAUDE database.

librarian

DeepDive Librarian for managing all data sets we publish and receive

halp

bert-pretraining

d3m-model-search

D3M Model Search Component

elementary

Data services and APIs

dependency_model

Structure learning code from [ICML'19 paper](https://arxiv.org/abs/1903.05844)