A curated list of awesome high performance computing resources.
Table of Contents
General Info
A Few Upcoming Supercomputers
- El Capitan - 2023, AMD-based, ~1.5 exaflops
- Aurora - 2022, Intel-based, ~2 exaflops
- Tianhe-3 - 2022, ~700 Petaflop (Linpack500)
Most Recent List of the Top500 Supercomputers
History
- History of Supercomputing (Wikipedia)
- History of the Top500 (Wikipedia)
- History of LLNL Computing
- The Supermen: The Story of Seymour Cray and the Technical Wizards Behind the Supercomputer (1997)
- Unmatched - 50 Years of Supercomputing (Coming out Sept 22. 2023)
Trends
Software
Popular HPC Programming Libraries/APIs/Tools/Standards
- OpenMP: Multi-platform Shared-memory Parallel Programming in C/C++ and Fortran
- HPX: A C++ Standard Library for Concurrency and Parallelism
- Kokkos: A C++ Programming Model for Writing Performance Portable Applications on HPC platforms
- Charm++: Parallel Programming with Migratable Objects
- CUDA: High performance NVIDIA GPU acceleration
- MPI: Message passing interface; OpenMPI implementation
- MPI: Message passing interface: MPICH implementation
- MPI Standardization Forum
- Taskflow: A Modern C++ Parallel Task Programming Library
- CAF: An Open Source Implementation of the Actor Model in C++
- Chapel: A Programming Language for Productive Parallel Computing on Large-scale Systems
- Cilk Plus: C/C++ Extension for Data and Task Parallelism
- OpenCilk: MIT continuation of Cilk Plus
- FastFlow: High-performance Parallel Patterns in C++
- Galois: A C++ Library to Ease Parallel Programming with Irregular Parallelism
- Heteroflow: Concurrent CPU-GPU Task Programming using Modern C++
- Intel TBB: Threading Building Blocks
- RaftLib: A C++ Library for Enabling Stream and Dataflow Parallel Computation
- STAPL: Standard Template Adaptive Parallel Programming Library in C++
- STLab: High-level Constructs for Implementing Multicore Algorithms with Minimized Contention
- Transwarp: A Header-only C++ Library for Task Concurrency
- PVM: Parallel Virtual Maschine: A predecessor to MPI for distributed computing
- OpenACC: "OpenMP for GPUs"
- numba: Numba is an open source JIT compiler that translates a subset of Python into fast machine code.
- dask: Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
- ray: scale AI and Python workloads — from reinforcement learning to deep learning
- RAJA: architecture and programming model portability for HPC applications
- ROCM: first open-source software development platform for HPC/Hyperscale-class GPU computing
- HIP: HIP is a C++ Runtime API and Kernel Language for AMD/Nvidia GPU
- MOGSLib - User defined schedulers
- SYCL - C++ Abstraction layer for heterogeneous devices
- Legion - Distributed heterogenous programming librrary
- SkelCL – A Skeleton Library for Heterogeneous Systems
- Legate - Nvidia replacement for numpy based on Legion
- The Open Community Runtime - Specification for Asynchronous Many Task systems
- Pyfi - distributed flow and computation system
- HPC-X - Nvidia implementation of MPI
- MPAVICH - Implementation of MPI
- mpi4py - python bindings for MPI
- UCX - optimized production proven-communication framework
- Horovod - distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet
- Taichi - parallel programming language for high-performance numerical computations (embedded in Python with JIT support)
- MAGMA - next generation linear algebra (LA) GPU accelerated libraries
- NVIDIA cuNumeric - GPU drop-in for numpy
- Halide - a language for fast, portable computation on images and tensors
- Microsoft MPI
- PMIX
- NCCL - The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking
- Kompute - The general purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends)
- alpaka - The alpaka library is a header-only C++17 abstraction library for accelerator development
- Kubeflow MPI Operator
- highway - performance portable SIMD intrinsics
- NVIDIA stdpar - GPU accelerated C++
- Tuplex - Blazing fast python data science
- Implicit SPMD Program Compiler (ISPC) - An open-source compiler for high-performance SIMD programming on the CPU and GPU
- mpi4jax - zero-copy mpi for jax arrays
- RS MPI - rust bindings for MPI
- async-rdma - A framework for writing RDMA applications with high-level abstraction and asynchronous APIs
- joblib - data-flow programming for performance (python)
- oneAPI - open, cross-industry, standards-based, unified, multiarchitecture, multi-vendor programming model
- Codon - high-performance Python compiler that compiles Python code to native machine code without any runtime overhead
- DeepSpeed - is an easy-to-use deep learning optimization software suite that enables unprecedented scale and speed for Deep Learning Training and Inference
- Intel ISPC - SPMD compiler
- Alpa - training large scale neural networks
- Scalix - data parallel computing framework
- HIPLZ - framework for intel-gpu architectures
- DeterminedAI - Distributed deep learning
Cluster Hardware Discovery Tools
- Likwid - provides all information about the supercomputer/cluster
- LIKWID.jl - julia wrapper for likwid
- cpuid
- cpuid instruction note
- cpufetch
- gpufetch
- intel cpuinfo
- openmpi hwloc
- PRK - Parallel Research Kernels
Cluster Management/Tools/Schedulers/Stacks
- Flux framework
- Bright Cluster Manager
- E4S - The Extreme Scale HPC Scientific Stack
- RADIUSS - Rapid Application Development via an Institutional Universal Software Stack
- OpenHPC
- Slurm
- SGE
- Portable Batch System & OpenPBS
- Lustre Parallel File System
- GPFS
- Spack package manager for HPC/supercomputers
- Guix package manager for HPC/supercomputers
- Easybuild package manager for HPC/supercomputers
- Lmod
- Ruse
- xCat
- Warewulf
- Bluebanquise
- OpenXdMod
- LSF
- BeeGFS
- DeepOps - Nvidia GPU infrastructure and automation tools
- fpsync - fast parallel data transfer using fpart and rsync
- moosefs - distributed file system
- rocks - open-source Linux cluster distribution
- sstack - a tool to install multiple software stacks, such as Spack, EasyBuild, and Conda
- DeepOps - Infrastructure automation tools for Kubernetes and Slurm clusters with NVIDIA GPUs
- OpenOnDemand - Access your organization’s supercomputers through the web to compute from anywhere, on any device.
- XDMoD - open source tool to facilitate the management of high performance computing resources
- Globus Connect - Fast transfer of data/files between supercomputers
HPC-specific Operating Systems
Development/Workflow/Monitoring Tools for HPC
- Apptainer (formerly Singularity) - "the docker of HPC"
- Docker
- Kubernetes
- slurm docker cluster
- Vaex - high performance dataframes in python
- HTCondor
- grpc - high performance modern remote procedure call framework
- Charliecloud
- Jacamar-ci
- Prefect
- Apache Airflow
- HPC Rocket - submit slurm jobs in CI
- Stui slurm dashboard for the terminal
- Slurmvision slurm dashboard
- genv - GPU Environment Management
- snakemake - a framework for reproducible data analysis
- ruptime - batch job monitoring
- remora - batch job monitoring
- perun - energy monitor
Debugging Tools for HPC
- Summary of C/C++ debugging tools
- ddt
- totalview
- marmot MPI checker
- python debugging tools
- Differential Flamegraphs
- Hotspot - linux perf GUI
Performance/Benchmark Tools for HPC
- Summary of code performance analysis tools
- papi
- scalasca
- tau
- scalene
- vampir
- kerncraft
- NASA parallel benchmark suite
- The Bandwidth Benchmark
- Google benchmark
- demonspawn
- HPL benchmark
- ngstress
- Ior
- bytehound memory profiler
- Flamegraphs
- fio
- IBM Spectrum Scale Key Performance Indicators (KPI)
- Hotspot - Hotspot - the Linux perf GUI for performance analysis
- mixbench - benchmarks for CPUs and GPUs
- pmu-tools (toplev) performance tools for modern Intel CPUs
- SPEC CPU Benchmark
- STREAM Memory Bandwidth Benchmark
- Intel MPI benchmarks
- Ohio state MPI benchmarks
- hpctoolkit - performance analysis toolkit
IO/Visualization Tools for HPC
General Purpose Scientific Computing Libraries for HPC
- petsc
- ginkgo
- GSL
- Scalapack
- rapids.ai - collection of libraries for executing end-to-end data science pipelines completely in the GPU
- trilinos
- tnl project
Misc.
- mimalloc memory allocator
- jemalloc memory allocator
- tcmalloc memory allocator
- Horde memory allocator
- Software utilization at UK National Supercomputing Service, ARCHER2
Wikis
Hardware
Interconnects/Topology
- Ethernet
- Infiniband
- Network topologies
- Battle of the infinibands - Omnipath vs Infiniband
- Mellanox infiniband cluster config
- RoCE - RDMA Over Converged Ethernet
- Slingshot interconnect
- CXL - Compute Express Link
CPU
- Wikichip
- Microarchitecture of Intel/AMD CPUs
- Apple M1
- Apple M2
- Apple M2 Teardown
- Apply M1/M2 AMX
- List of Intel processors
- List of Intel micro architectures
- Comparison of Intel processors
- Comparison of Apple processors
- List of AMD processors
- List of AMD CPU micro architectures
- Comparison of AMD architectures
GPU
- Gpu Architecture Analysis
- A trip through the Graphics Pipeline
- A100 Whitepaper
- MIG
- Gentle Intro to GPU Inner Workings
- AMD Instinct GPUs
- List of AMD GPUs
- Comparison of CUDA architectures
- Tales of the M1 GPU
- List of Intel GPUs
- Performance of DGX Cluster
TPU/Tensor Cores
Many integrated core processor (MIC)
Cloud
Vendors
- AWS HPC
- Azure HPC
- rescale
- vast.ai
- vultr - cheap bare metal CPU, GPU, DGX servers
- hetzner - cheap servers incl. 80-core ARM
- Ampere ARM cloud-native processors
- Scaleway
- Chameleon Cloud
Articles/Papers
- The use of Microsoft Azure for high performance cloud computing – A case study
- AWS Cluster in the cloud
- AWS Parallel Cluster
- An Empirical Study of Containerized MPI and GUI Application on HPC in the Cloud
Custom/FPGA/ASIC/APU
Certification
Student Opportunities
- Supercomputing Conference Student Opportunities
- SCC Student cluster competition
- Winter Classic Invitational
- Linux Cluster Institute
Other/Wikis
- Supercomputer
- Supercomputer architecture
- Computer cluster
- Comparison of Intel processors
- Comparison of Apple processors
- Comparison of AMD architectures
- Comparison of CUDA architectures
- Cache
- Google TPU
- IPMI
- FRU
- Disk Arrays
- RAID
- Cray
People
- Jack Dongarra - 2021 Turing Award - LINPACK, BLAS, LAPACK, MPI
- Bill Gropp - 2010 IEEE TCSC Medal for Excellence in Scalable Computing
- David Bader - built the first Linux supercomputer
- Thomas Sterling - Inventor of Beowulf cluster, ParalleX/HPX
- Seymour Cray - Inventor of the Cray Supercomputer
- Larry Smarr - HPC Application Pioneer
Resources
Books/Manuals
- HPC Books by Victor Eijkhout
- Parallel and High Performance Computing
- Algorithms for Modern Hardware
- High Performance Computing: Modern Systems and Practices - Thomas Sterling, Maciej Brodowicz, Matthew Anderson 2017
- Introduction to High Performance Computing for Scientists and Engineers - Hager 2010
- Computer Organization and Design
- Optimizing HPC Applications with Intel Cluster Tools: Hunting Petaflops
- Introduction to High Performance Scientific Computing - Victor Eijkhout 2021
- Parallel Programming for Science and Engineering - Victor EIjkhout 2021
- Parallel Programming for Science and Engineering - HTML Version
- C++ High Performance
- Data Parallel C++ Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL
- High Performance Python
- C++ Concurrency in Action: Practical Multithreading - Anthony Williams 2012
- The Art of Multiprocessor Programming - Maurice Herlihy 2012
- Parallel Computing: Theory and Practice - Umut A. Acar 2016
- Introduction to Parallel Computing - Zbigniew J. Czech
- Practical guide to bare metal C++
- Optimizing software in C++
- Optimizing subroutines in assembly code
- Microarchitecture of Intel/AMD CPUs
- Parallel Programming with MPI
- HPC, Big Data, AI Convergence Towards Exascale: Challenge and Vision
- Introduction to parallel computing - Ananth Grama
- The Student Supercomputer Challenge Guide
- The Rust Performance Book
- E-Zines on Bash, Linux, Perf, etc - Julia Evans
- The Art of Writing Efficient Programs: An Advanced Programmer's Guide to Efficient Hardware Utilization and Compiler Optimizations Using C++ Examples
Courses
- Berkeley: Applications of Parallel Computers - Detailed course on HPC
- CS6290 High-performance Computer Architecture - Milos Prvulovic and Catherine Gamboa at George Tech
- Udacity High Performance Computing
- Parallel Numerical Algorithms
- Vanderbilt - Intro to HPC
- Illinois - Intro to HPC - Creator of PyCuda
- Archer1 Courses
- TACC tutorials
- Livermore training materials
- Xsede training materials
- Parallel Computation Math
- Introduction to High-Performance and Parallel Computing - Coursera
- Foundations of HPC 2020/2021
- Principles of Distributed Computing
- High Performance Visualization
- Temple course on building/maintaining a cluster
- Nvidia Deep Learning Course
- Coursera GPU Programming Specialization
- Coursera Fundamentals of Parallelism on Intel Architecture
- Coursera Introduction to High Performance Computing
- Archer2 Shared Memory Programming with OpenMP
- Archer2 Message-Passing Programming with MPI
- HetSys 2022 Course
- Edukamu Introduction to Supercomputing
- Heterogeneous Parallel Programming by S K
- NCSA HPC Training Moodle
- Supercomputing in plain english
- Cornell workshop
- Carpentries Incubator HPC Intro
- UL HPC School
- Introduction to High-Performance Parallel Distributed Computing using Chapel, UPC++ and Coarray Fortran
Tutorials/Guides/Articles
- MpiTutorial - A fantastic mpi tutorial
- Beginners Guide to HPC
- Rookie HPC Guide
- RedHat High Performance Computing 101
- Parallel Computing Training Tutorials - Lawrence Livermore National Laboratory
- Foundations of Multithreaded, Parallel, and Distributed Programming
- Building pipelines using slurm dependencies
- Writing slurm scripts in python,r and bash
- Xsede new user tutorials
- Supercomputing in plain english
- Improving Performance with SIMD intrinsics
- Want speed? Pass by value
- Introduction to low level bit hacks
- How to write fast numerical code: An Introduction
- Lecture notes on Loop optimizations
- A practical approach to code optimization
- Software optimization manuals
- Guide into OpenMP: Easy multithreading programming for C++
- An Introduction to the Partitioned Global Address Space (PGAS) Programming Model
- Jax in 2022
- C++ Benchmarking for beginners
- Mapping MPI ranks to multiple cuda GPU
- Oak Ridge National Lab Tutorials
- How to perform large scale data processing in bioinformatics
- Step by step SGEMM in OpenCL
- Frontier User Guide
- Allocating large blocks of memory in bare-metal C programming
- Hashmap benchmarks 2022
- LLNL HPC Tutorials
- High Performance Computing: A Bird's Eye View
- The dirty secret of high performance computing
- Multiple GPUs with pytorch
- Brendan Gregg on Linux Performance
- Automatic Slurm build scripts
- Fastest unordered_map implementation / benchmarks
- Memory bandwith NapkinMath
- Avoiding Instruction Cache Misses
- Multi-GPU Programming with Standard Parallel C++
- EuroCC National Competence Center Sweden (ENCCS) HPC tutorials
- LLNL hpc tutorials
- python.org Python Performance Tips
- HPC toolset tutorial (cluster management)
Presentations
- Undersubscription: An Underutilized Factor in High-Performance Computing
- Practical Debugging and Performance Engineering
Review Papers/Articles
- The Landscape of Exascale Research: A Data-Driven Literature Analysis (2020)
- The Landscape of Parallel Computing Research: A View from Berkeley
- Extreme Heterogeneity 2018: Productive Computational Science in the Era of Extreme Heterogeneity
- Programming for Exascale Computers - Will Gropp, Marc Snir
- On the Memory Underutilization: Exploring Disaggregated Memory on HPC Systems (2020)
- Advances in Parallel & Distributed Processing, and Applications (conference proceedings)
- Designing Heterogeneous Systems: Large Scale Architectural Exploration Via Simulation
- Reinventing High Performance Computing: Challenges and Opportunities (2022)
- Challenges in Heterogeneous HPC White Paper (2022)
- An Evolutionary Technical & Conceptual Review on High Performance Computing Systems (Dec 2021)
- New Horizons for High-Performance Computing (2022)
- CConfidential High-Performance Computing in the Public Cloud
- Containerisation for High Performance Computing Systems: Survey and Prospects
- Heterogeneous Computing Systems (2023)
- Myths and Legends in High-Performance Computing
- Energy-Aware Scheduling for High-Performance Computing Systems: A Survey
- Ultimate Physical limits to computation - Seth Lloyd
- Myths and Legends in High-Performance Computing
- Abstract Machine Models and Proxy Architectures for Exascale Computing, 2014, Sandia National Laboratories and Lawrence Berkeley National Laboratory
- Some thoughts on the environmental impact of High Performance Computing
- A Research Retrospective on AMD's Exascale Computing Journey
News
Podcasts
Youtube Videos/Courses/Channels
- Argonne supercomputer tour
- Containers in HPC - what they fix and what they break
- HPC Tech Shorts
- CppCon
- Create a clustering server
- Argonne national lab
- Oak Ridge National Lab
- Concurrency in C++20 and Beyond - A. Williams
- Is Parallel Programming still Hard? - P. McKenney, M. Michael, and M. Wong at CppCon 2017
- The Speed of Concurrency: Is Lock-free Faster? - Fedor G Pikus in CppCon 2016
- Expressing Parallelism in C++ with Threading Building Blocks - Mike Voss at Intel Webinar 2018
- A Work-stealing Runtime for Rust - Aaron Todd in Air Mozilla 2017
- C++11/14/17 atomics and memory model: Before the story consumes you - Michael Wong in CppCon 2015
- The C++ Memory Model - Valentin Ziegler at C++ Meeting 2014
- Sharcnet HPC
- Low Latency C++ for fun and profit
- scalane python profiler
- Kokkos lectures
- EasyBuild Tech Talk I - The ABCs of Open MPI, part 1 (by Jeff Squyres & Ralph Castain)
- The Spack 2022 Roadmap
- A Not So Simple Matter of Software | Talk by Turing Award Winner Prof. Jack Dongarra
- Vectorization/SIMD intrinsics
- New Silicon for Supercomputers: A Guide for Software Engineers
- TechTechPotato Channel
Presentation Slides
- Task based Parallelism and why it's awesome - Pedro Gonnet
- Tuning Slurm Scheduling for Optimal Responsiveness and Utilization
- Parallel Programming Models Overview (2020)
- Comparative Analysis of Kokkos and Sycl (Jeff Hammond)
- Hybrid OpenMP/MPI Programming
- Designs, Lessons and Advice from Building Large Distributed Systems - Jeff Dean (Google)
Forums
Careers
- HPC University Careers search
- HPC wire career site
- HPC certification
- HPC SysAdmin Jobs (reddit)
- The United States Research Software Engineer Association
- NCSA Internship
- AI and Future HPC Job Prospect
Membership Clubs
Blogs
- 1024 Cores - Dmitry Vyukov
- The Black Art of Concurrency - Internal Pointers
- Cluster Monkey
- Johnathon Dursi
- Arm Vendor HPC blog
- HPC Notes
- Brendan Gregg Performance Blog
Journals
- IEEE Transactions on Parallel and Distributed Systems (TPDS)
- Journal of Parallel and Distributed Computing
Conferences
- ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP)
- ACM Symposium on Parallel Algorithms and Architectures (SPAA)
- SC conference (SC)
- IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- International Conference on Parallel Processing (ICPP)
- IEEE High Performance Extreme Computing Conference (HPEC)
Communities/Chat Groups
Twitters
Consulting
Interview Preparation
Organizations
- Prace
- Xsede
- Compute Canada
- Riken CSS
- Pawsey
- International Data Corporation
- List of Federally funded research and development centers
Misc. Wikis
- Amdahl's Law
- HPC Wiki
- FLOPS
- Computational complexity of math operations
- Many Task Computing
- High Throughput Computing
- Parallel Virtual Machine
- OSI Model
- Workflow management
- Compute Canada Documentation
- Network Interface Controller (NIC)
- Just in time compilation
- List of distributed computing projects
- Computer cluster
- Quasi-opportunistic supercomputing
- Limits of Computation
- Bremermann's Limit
Building Clusters
- Build a cluster under 50k
- Build a Beowulf cluster
- Build a Raspberry Pi Cluster
- Puget Systems
- Lambda Systems
- Titan computers
- Temple course on building/maintaining a cluster
- Detailed reddit discussion on setting up a small cluster
- Tiny titan - build a really cool pi supercomputer
- Building an Intel HPC cluster with OpenHPC
- Reddit r/HPC post on building clusters
- Build a virtual cluster with PelicanHPC
Misc. Papers/Articles
- Advanced Parallel Programming in C++
- Tools for scientific computing
- Quantum Computing for High Performance Computing
- Benchmarking data science: Twelve ways to lie with statistics and performance on parallel computers.
- Establishing the IO500 Benchmark
- NVIDIA High Performance Computing articles
- Let's write a superoptimizer
- How the titan supercomputer was recycled
Misc. Repos
- Build a Beowulf cluster
- libsc - Supercomputing library
- xbyak jit assembler
- cpufetch - pretty cpu info fetcher
- RRZE-HPC
- Argonne Github
- Argonne Leadership Computing Facility
- Oak Ridge National Lab Github
- Compute Canada
- HPCInfo by Jeff Hammond
- Texas Advanced Computing Center (TACC) Github
- LANL HPC Github
- Rust in HPC
- University of Buffalo - Center for Computational Research
Misc.
- Exascale Project
- Pocket HPC Survival Guide
- HPC Summer school
- Overview of all linear algebra packages
- Latency numbers
- Nvidia HPC benchmarks
- Intel Intrinsics Guide
- AWS Cloud calculator
- Quickly benchmark C++ functions
- LLNL Software repository
- Boinc - volunteer computing projects
- Prace Training Events
- Nice discussion on FlameGraph profiling
- Nice discussion on parts of a supercomputer on reddit
- Technical Report on C++ performance
- BOINC Compute for science
- Count prime numbers using MPI
Misc. Theses
Other Curated Lists
Acknowledgements
This repo started from the great curated list https://github.com/taskflow/awesome-parallel-computing