Awesome-GPU
Architecture
Resources Management
- TECS'21-Reducing Energy in GPGPUs through Approximate Trivial Bypassing
- ASPLOS'17-Locality-Aware CTA Clustering for Modern GPUs
- ASPLOS'17-Dynamic Resource Management for Efficient Utilization of Multitasking GPUs
- HPCA'17-Dynamic GPGPU Power Management Using Adaptive Model Predictive Control
- ISCA'16-Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems
Parallelism
- HPCA'18-Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls
- HPCA'17-Controlled Kernel Launch for Dynamic Parallelism in GPUs
- GTC'17-COOPERATIVE GROUPS
- ISCA'16-LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs
- ISCA'16-Virtual Thread Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit
- Berkeley TechRpts'16-Understanding Latency Hiding on GPUs
Cache
- ISCA'16-APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs
- SC'15-Adaptive and Transparent Cache Bypassing for GPUs
Memory
- ICCAD'21-Improving Inter-kernel Data Reuse With CTA-Page Coordination in GPGPU
- SC'21-In-Depth Analyses of Unified Virtual Memory System for GPU Accelerated Computing
- IBM'20-Umpire: Application-Focused Management and Coordination of Complex Hierarchical Memory
- HPCA'13-Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization
White Papers
- NVIDIA Hopper-NVIDIA H100 Tensor Core GPU Architecture
- NVIDIA Ampere-NVIDIA A100 Tensor Core GPU Architecture
- NVIDIA Turing-NVIDIA TURING GPU ARCHITECTURE
- NVIDIA Volta-NVIDIA TESLA V100
- NVIDIA Pascal-NVIDIA TESLA P100
- NVIDIA Kepler-NVIDIA’s Next Generation CUDA Compute Architecture: Kepler
- NVIDIA Fermi-NVIDIA’s Next Generation CUDA Compute Architecture: Fermi
- AMD CDNA 2-INTRODUCING AMD CDNA 2 ARCHITECTURE
- AMD CDNA-INTRODUCING AMD CDNA ARCHITECTURE
Algorithms
BLAS
- GTC'20-DEVELOPING CUDA KERNELS TO PUSH TENSOR CORES TO THE ABSOLUTE LIMIT ON NVIDIA A100
- IPDPS'20-Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply
- PPoPP'19-A Coordinated Tiling and Batching Framework for Efficient GEMM on GPU
- GTC'18-CUTLASS: CUDA TEMPLATE LIBRARY FOR DENSE LINEAR ALGEBRA AT ALL LEVELS AND SCALES
Stencils
- CGO'20-AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs
- IPDPS'20-On Optimizing Complex Stencils on GPUs
- PPoPP'18-Register Optimizations for Stencils on GPUs
Scans
- NVResearch TechRpts'16-Single-pass Parallel Prefix Scan with Decoupled Look-back
Applications
Deep Learning
- PPoPP'21-Understanding and bridging the gaps in current GNN performance optimizations
- SC'21-E.T.: re-thinking self-attention for transformer models on GPUs
- OSDI'21-GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs
- SC'20-Sparse GPU Kernels for Deep Learning
- PPoPP'18-SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks
- HPCA'17-Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures
Tools
Benchmarking
- GTC'18-Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
- ISPASS'10-Demystifying GPU Microarchitecture through Microbenchmarking
Models
- PMBS'19-Instruction Roofline An insightful visual performance model for GPUs
- ECP'19-Performance Tuning of Scientific Codes with the Roofline Model
- GTC'18-VOLTA Architecture and performance optimization
- Synthesis Lectures on Computer Architecture'12-Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
- SC'10-Fundamental_Optimizations
Simulators
- ISPASS'10-Visualizing Complex Dynamics in Many-Core Accelerator Architectures
- ISPASS'09-Analyzing CUDA Workloads Using a Detailed GPU Simulator
Profilers
- PLDI'18-GPU Code Optimization using Abstract Kernel Emulation and Sensitivity Analysis
- CGO'18-CUDAAdvisor: LLVM-based runtime profiling for modern GPUs
- CCGRID'18-Exposing Hidden Performance Opportunities in High Performance GPU Applications
- THPC'16-Monitoring Heterogeneous Applications with the OpenMP Tools Interface
- Euro-Par'15-Identifying Optimization Opportunities Within Kernel Execution in GPU Codes
- SC'13-Effective sampling-driven performance tools for GPU-accelerated supercomputers
- ISPASS'12-Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures
- ICPP'11-Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs
- Vampir|Score-P
- TAU
- PAPI
- Allinea MAP
- Open|SpeedShop
- HPCToolkit
- NVIDIA Nsight Systems
- NVIDIA Nsight Compute
- SASSI
- NVBit
Runtime
Scheduling
- PPoPP'22-CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems
- TPDS'20-cCUDA: Effective Co-Scheduling of Concurrent Kernels on GPUs
Code Generation
Compilers
- AMD'21-Generating GPU Compiler Heuristics using Reinforcement Learning
- TACO'21-Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation
- LLVM'17-Implementing implicit OpenMP data sharing on GPUs
- CGO'16-gpucc: An Open-Source GPGPU Compiler
- LLVM'16-Offloading Support for OpenMP in Clang and LLVM
- PMBS'15-Performance Analysis of OpenMP on a GPU using a CORAL Proxy Application
- LLVM'15-Integrating GPU Support for OpenMP Offloading Directives into Clang
- LLVM'14-Coordinating GPU Threads for OpenMP 4.0 in LLVM
Programming Models
- CGO'21-C-for-metal: high performance SIMD programming on intel GPUs
- ECRTS'19-Novel Methodologies for Predictable CPU-To-GPU Command Offloading
- ASPLOS'14-Paraprox: Pattern-Based Approximation for Data Parallel Applications
Profile Guided Optimization
- Geometry and Optimization'21-Cooperative Profile Guided Optimizations
- IPDPS'13-Kernel Specialization for Improved Adaptability and Performance on Graphics Processing Units (GPUs)