• Stars
    star
    162
  • Rank 232,284 (Top 5 %)
  • Language
    Julia
  • License
    BSD 3-Clause "New...
  • Created almost 5 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Almost trivial distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid

ImplicitGlobalGrid.jl ImplicitGlobalGrid.jl

CI Coverage

ImplicitGlobalGrid is an outcome of a collaboration of the Swiss National Supercomputing Centre, ETH Zurich (Dr. Samuel Omlin) with Stanford University (Dr. Ludovic Rรคss) and the Swiss Geocomputing Centre (Prof. Yuri Podladchikov). It renders the distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid almost trivial and enables close to ideal weak scaling of real-world applications on thousands of GPUs [1, 2, 3]:

Weak scaling Piz Daint

ImplicitGlobalGrid relies on the Julia MPI wrapper (MPI.jl) to perform halo updates close to hardware limit and leverages CUDA-aware or ROCm-aware MPI for GPU-applications. The communication can straightforwardly be hidden behind computation [1, 3] (how this can be done automatically when using ParallelStencil.jl is shown in [3]; a general approach particularly suited for CUDA C applications is explained in [4]).

A particularity of ImplicitGlobalGrid is the automatic implicit creation of the global computational grid based on the number of processes the application is run with (and based on the process topology, which can be explicitly chosen by the user or automatically defined). As a consequence, the user only needs to write a code to solve his problem on one GPU/CPU (local grid); then, as little as three functions can be enough to transform a single GPU/CPU application into a massively scaling Multi-GPU/CPU application. See the example below. 1-D, 2-D and 3-D grids are supported. Here is a sketch of the global grid that results from running a 2-D solver with 4 processes (P1-P4) (a 2x2 process topology is created by default in this case):

Implicit global grid

Contents

Multi-GPU with three functions

Only three functions are required to perform halo updates close to hardware limit:

  • init_global_grid
  • update_halo!
  • finalize_global_grid

Three additional functions are provided to query Cartesian coordinates with respect to the global computational grid if required:

  • x_g
  • y_g
  • z_g

Moreover, the following three functions allow to query the size of the global grid:

  • nx_g
  • ny_g
  • nz_g

The following Multi-GPU 3-D heat diffusion solver illustrates how these functions enable the creation of massively parallel applications.

50-lines Multi-GPU example

This simple Multi-GPU 3-D heat diffusion solver uses ImplicitGlobalGrid. It relies fully on the broadcasting capabilities of CUDA.jl's CuArray type to perform the stencil-computations with maximal simplicity (CUDA.jl enables also writing explicit GPU kernels which can lead to significantly better performance for these computations).

using ImplicitGlobalGrid, CUDA

@views d_xa(A) = A[2:end  , :     , :     ] .- A[1:end-1, :     , :     ];
@views d_xi(A) = A[2:end  ,2:end-1,2:end-1] .- A[1:end-1,2:end-1,2:end-1];
@views d_ya(A) = A[ :     ,2:end  , :     ] .- A[ :     ,1:end-1, :     ];
@views d_yi(A) = A[2:end-1,2:end  ,2:end-1] .- A[2:end-1,1:end-1,2:end-1];
@views d_za(A) = A[ :     , :     ,2:end  ] .- A[ :     , :     ,1:end-1];
@views d_zi(A) = A[2:end-1,2:end-1,2:end  ] .- A[2:end-1,2:end-1,1:end-1];
@views  inn(A) = A[2:end-1,2:end-1,2:end-1]

@views function diffusion3D()
    # Physics
    lam        = 1.0;                                       # Thermal conductivity
    cp_min     = 1.0;                                       # Minimal heat capacity
    lx, ly, lz = 10.0, 10.0, 10.0;                          # Length of domain in dimensions x, y and z

    # Numerics
    nx, ny, nz = 256, 256, 256;                             # Number of gridpoints dimensions x, y and z
    nt         = 100000;                                    # Number of time steps
    init_global_grid(nx, ny, nz);                           # Initialize the implicit global grid
    dx         = lx/(nx_g()-1);                             # Space step in dimension x
    dy         = ly/(ny_g()-1);                             # ...        in dimension y
    dz         = lz/(nz_g()-1);                             # ...        in dimension z

    # Array initializations
    T     = CUDA.zeros(Float64, nx,   ny,   nz  );
    Cp    = CUDA.zeros(Float64, nx,   ny,   nz  );
    dTedt = CUDA.zeros(Float64, nx-2, ny-2, nz-2);
    qx    = CUDA.zeros(Float64, nx-1, ny-2, nz-2);
    qy    = CUDA.zeros(Float64, nx-2, ny-1, nz-2);
    qz    = CUDA.zeros(Float64, nx-2, ny-2, nz-1);

    # Initial conditions (heat capacity and temperature with two Gaussian anomalies each)
    Cp .= cp_min .+ CuArray([5*exp(-((x_g(ix,dx,Cp)-lx/1.5))^2-((y_g(iy,dy,Cp)-ly/2))^2-((z_g(iz,dz,Cp)-lz/1.5))^2) +
                             5*exp(-((x_g(ix,dx,Cp)-lx/3.0))^2-((y_g(iy,dy,Cp)-ly/2))^2-((z_g(iz,dz,Cp)-lz/1.5))^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)])
    T  .= CuArray([100*exp(-((x_g(ix,dx,T)-lx/2)/2)^2-((y_g(iy,dy,T)-ly/2)/2)^2-((z_g(iz,dz,T)-lz/3.0)/2)^2) +
                    50*exp(-((x_g(ix,dx,T)-lx/2)/2)^2-((y_g(iy,dy,T)-ly/2)/2)^2-((z_g(iz,dz,T)-lz/1.5)/2)^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)])

    # Time loop
    dt = min(dx*dx,dy*dy,dz*dz)*cp_min/lam/8.1;                                               # Time step for the 3D Heat diffusion
    for it = 1:nt
        qx    .= -lam.*d_xi(T)./dx;                                                           # Fourier's law of heat conduction: q_x   = -ฮป ฮดT/ฮดx
        qy    .= -lam.*d_yi(T)./dy;                                                           # ...                               q_y   = -ฮป ฮดT/ฮดy
        qz    .= -lam.*d_zi(T)./dz;                                                           # ...                               q_z   = -ฮป ฮดT/ฮดz
        dTedt .= 1.0./inn(Cp).*(-d_xa(qx)./dx .- d_ya(qy)./dy .- d_za(qz)./dz);               # Conservation of energy:           ฮดT/ฮดt = 1/cโ‚š (-ฮดq_x/ฮดx - ฮดq_y/dy - ฮดq_z/dz)
        T[2:end-1,2:end-1,2:end-1] .= inn(T) .+ dt.*dTedt;                                    # Update of temperature             T_new = T_old + ฮดT/ฮดt
        update_halo!(T);                                                                      # Update the halo of T
    end

    finalize_global_grid();                                                                   # Finalize the implicit global grid
end

diffusion3D()

The corresponding file can be found here. A basic cpu-only example is available here (no usage of multi-threading).

Straightforward in-situ visualization / monitoring

ImplicitGlobalGrid provides a function to gather an array from each process into a one large array on a single process, assembled according to the global grid:

  • gather!

This enables straightforward in-situ visualization or monitoring of Multi-GPU/CPU applications using e.g. the Julia Plots package as shown in the following (the GR backend is used as it is particularly fast according to the Julia Plots documentation). It is enough to add a couple of lines to the previous example (omitted unmodified lines are represented with #(...)):

using ImplicitGlobalGrid, CUDA, Plots
#(...)

@views function diffusion3D()
    # Physics
    #(...)

    # Numerics
    #(...)
    me, dims   = init_global_grid(nx, ny, nz);              # Initialize the implicit global grid
    #(...)

    # Array initializations
    #(...)

    # Initial conditions (heat capacity and temperature with two Gaussian anomalies each)
    #(...)

    # Preparation of visualisation
    gr()
    ENV["GKSwstype"]="nul"
    anim = Animation();
    nx_v = (nx-2)*dims[1];
    ny_v = (ny-2)*dims[2];
    nz_v = (nz-2)*dims[3];
    T_v  = zeros(nx_v, ny_v, nz_v);
    T_nohalo = zeros(nx-2, ny-2, nz-2);

    # Time loop
    #(...)
    for it = 1:nt
        if mod(it, 1000) == 1                                                                 # Visualize only every 1000th time step
            T_nohalo .= Array(T[2:end-1,2:end-1,2:end-1]);                                    # Copy data to CPU removing the halo
            gather!(T_nohalo, T_v)                                                            # Gather data on process 0 (could be interpolated/sampled first)
            if (me==0) heatmap(transpose(T_v[:,ny_vรท2,:]), aspect_ratio=1); frame(anim); end  # Visualize it on process 0
        end
        #(...)
    end

    # Postprocessing
    if (me==0) gif(anim, "diffusion3D.gif", fps = 15) end                                     # Create a gif movie on process 0
    if (me==0) mp4(anim, "diffusion3D.mp4", fps = 15) end                                     # Create a mp4 movie on process 0
    finalize_global_grid();                                                                   # Finalize the implicit global grid
end

diffusion3D()

Here is the resulting movie when running the application on 8 GPUs, solving 3-D heat diffusion with heterogeneous heat capacity (two Gaussian anomalies) on a global computational grid of size 510x510x510 grid points. It shows the x-z-dimension plane in the middle of the dimension y:

Implicit global grid

The simulation producing this movie - including the in-situ visualization - took 29 minutes on 8 NVIDIAยฎ Teslaยฎ P100 GPUs on Piz Daint (an optimized solution using CUDA.jl's native kernel programming capabilities can be more than 10 times faster). The complete example can be found here. A corresponding basic cpu-only example is available here (no usage of multi-threading) and a movie of a simulation with 254x254x254 grid points which it produced within 34 minutes using 8 Intelยฎ Xeonยฎ E5-2690 v3 is found here (with 8 processes, no multi-threading).

Seamless interoperability with MPI.jl

ImplicitGlobalGrid is seamlessly interoperable with MPI.jl. The Cartesian MPI communicator it uses is created by default when calling init_global_grid and can then be obtained as follows (variable comm_cart):

me, dims, nprocs, coords, comm_cart = init_global_grid(nx, ny, nz);

Moreover, the automatic initialization and finalization of MPI can be deactivated in order to replace them with direct calls to MPI.jl:

init_global_grid(nx, ny, nz; init_MPI=false);
finalize_global_grid(;finalize_MPI=false)

Besides, init_global_grid makes every argument it passes to an MPI.jl function customizable via its keyword arguments.

CUDA-aware/ROCm-aware MPI support

If the system supports CUDA-aware/ROCm-aware MPI, it may be activated for ImplicitGlobalGrid by setting an environment variable as specified in the module documentation callable from the Julia REPL or in IJulia (see next section).

Module documentation callable from the Julia REPL / IJulia

The module documentation can be called from the Julia REPL or in IJulia:

julia> using ImplicitGlobalGrid
julia>?
help?> ImplicitGlobalGrid
search: ImplicitGlobalGrid

  Module ImplicitGlobalGrid

  Renders the distributed parallelization of stencil-based GPU and CPU applications on a
  regular staggered grid almost trivial and enables close to ideal weak scaling of
  real-world applications on thousands of GPUs.

  General overview and examples
  โ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰ก

  https://github.com/eth-cscs/ImplicitGlobalGrid.jl

  Functions
  โ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰ก

    โ€ข    init_global_grid

    โ€ข    finalize_global_grid

    โ€ข    update_halo!

    โ€ข    gather!

    โ€ข    select_device

    โ€ข    nx_g

    โ€ข    ny_g

    โ€ข    nz_g

    โ€ข    x_g

    โ€ข    y_g

    โ€ข    z_g

    โ€ข    tic

    โ€ข    toc

  To see a description of a function type ?<functionname>.

  โ”‚ Performance note
  โ”‚
  โ”‚  If the system supports CUDA-aware MPI (for Nvidia GPUs) or
  โ”‚  ROCm-aware MPI (for AMD GPUs), it may be activated for
  โ”‚  ImplicitGlobalGrid by setting one of the following environment
  โ”‚  variables (at latest before the call to init_global_grid):
  โ”‚
  โ”‚  shell> export IGG_CUDAAWARE_MPI=1
  โ”‚
  โ”‚  shell> export IGG_ROCMAWARE_MPI=1

julia>

Dependencies

ImplicitGlobalGrid relies on the Julia MPI wrapper (MPI.jl), the Julia CUDA package (CUDA.jl [5, 6]) and the Julia AMDGPU package (AMDGPU.jl).

Installation

ImplicitGlobalGrid may be installed directly with the Julia package manager from the REPL:

julia>]
  pkg> add ImplicitGlobalGrid
  pkg> test ImplicitGlobalGrid

References

[1] Rรคss, L., Omlin, S., & Podladchikov, Y. Y. (2019). Porting a Massively Parallel Multi-GPU Application to Julia: a 3-D Nonlinear Multi-Physics Flow Solver. JuliaCon Conference, Baltimore, USA.

[2] Rรคss, L., Omlin, S., & Podladchikov, Y. Y. (2019). A Nonlinear Multi-Physics 3-D Solver: From CUDA C + MPI to Julia. PASC19 Conference, Zurich, Switzerland.

[3] Omlin, S., Rรคss, L., Kwasniewski, G., Malvoisin, B., & Podladchikov, Y. Y. (2020). Solving Nonlinear Multi-Physics on GPU Supercomputers with Julia. JuliaCon Conference, virtual.

[4] Rรคss, L., Omlin, S., & Podladchikov, Y. Y. (2019). Resolving Spontaneous Nonlinear Multi-Physics Flow Localisation in 3-D: Tackling Hardware Limit. GPU Technology Conference 2019, San Jose, Silicon Valley, CA, USA.

[5] Besard, T., Foket, C., & De Sutter, B. (2018). Effective Extensible Programming: Unleashing Julia on GPUs. IEEE Transactions on Parallel and Distributed Systems, 30(4), 827-841. doi: 10.1109/TPDS.2018.2872064

[6] Besard, T., Churavy, V., Edelman, A., & De Sutter B. (2019). Rapid software prototyping for heterogeneous and distributed platforms. Advances in Engineering Software, 132, 29-46. doi: 10.1016/j.advengsoft.2019.02.002

More Repositories

1

COSMA

Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
C++
192
star
2

sarus

OCI-compatible engine to deploy Linux containers on HPC environments.
C++
129
star
3

abcpy

ABCpy package
Python
113
star
4

PythonHPC

PythonHPC
Jupyter Notebook
110
star
5

DLA-Future

DLA-Future
C++
64
star
6

production

General interest repository for CSCS users
Python
49
star
7

SpFFT

Sparse 3D FFT library with MPI, OpenMP, CUDA and ROCm support
C++
48
star
8

firecrest

Python
33
star
9

SummerSchool2021

PostScript
32
star
10

spla

Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.
C++
27
star
11

SummerSchool2020

Jupyter Notebook
26
star
12

SummerSchool2019

CSCS HPC Summer School 2019
Jupyter Notebook
25
star
13

spack-batteries-included

Installing spack without system dependencies
C
25
star
14

examples_cpp

Examples of designs using C++11/14
C++
25
star
15

SummerUniversity2022

PostScript
25
star
16

SummerUniversity2024

C++
24
star
17

Tiled-MM

Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.
C++
22
star
18

stackinator

Python
18
star
19

sshservice-cli

Shell
17
star
20

pytorch-training

PyTorch training at CSCS
Jupyter Notebook
15
star
21

manta

Another CLI for Alps
Rust
14
star
22

pascal-training

Teaching materials, slides and exercises, for the GPU & CUDA training in 2017
Cuda
13
star
23

cpp-course-2023

C++
13
star
24

conflux

Distributed Communication-Optimal LU-factorization Algorithm
C++
12
star
25

tensorflow-training

Multi-GPU training with TensorFlow on Piz Daint
Jupyter Notebook
12
star
26

abcpy-models

Python
11
star
27

COSTA

Distributed Communication-Optimal Shuffle and Transpose Algorithm
C++
11
star
28

ext_mpi_collectives

ext_mpi_collectives
C
10
star
29

SummerSchool2016

C++
10
star
30

pyfirecrest

Python wrappers for the FirecREST API
Python
10
star
31

SummerUniversity2023

C++
10
star
32

gpu-training

C++
10
star
33

SummerSchool2018

CSCS HPC Summer School 2018
C++
10
star
34

cmake-recipes

Repository for collecting, curating and maintaining up to date CMake scripts.
CMake
9
star
35

uenv

https://eth-cscs.github.io/uenv/
Python
9
star
36

squashfs-mount

Setuid instead of FUSE for mounting squashfs files.
C
9
star
37

cscs-reframe-tests

The CSCS ReFrame test suite
Python
8
star
38

openstack

Shell
8
star
39

alps-uenv

Recipes for software stacks on Alps vClusters.
Python
8
star
40

UserLabDay

CSCS User Lab Day โ€“ Meet the Swiss National Supercomputing Centre
Jupyter Notebook
8
star
41

slurm-container

Shell
7
star
42

alps-cluster-config

Python
7
star
43

cscs_beamer_style

TeX
6
star
44

ContainerHackathon

Jupyter Notebook
6
star
45

slurm-replay

Replay job submissions for Slurm
C
6
star
46

node-burn

C++
6
star
47

SummerSchool2015

Repository for summer school information that will be provided to students
C++
6
star
48

SummerSchool2017

C++
5
star
49

interactive

Interactive Computing with Jupyter on Piz Daint, using Python, ParaView and Julia
Jupyter Notebook
5
star
50

slurm-uenv-mount

C++
5
star
51

DLA-Future-Fortran

Fortran interface for DLA-Future
Fortran
5
star
52

py2spack

Automatic conversion of standard Python packages to Spack package recipes.
Python
4
star
53

hpctools

Debugging and Performance Tools examples
Python
4
star
54

tools

CSCS tools including R, python, netcdf, etc...
Python
4
star
55

ipcluster_magic

Magic commands to support running MPI python code as well as multi-node Dask workloads on Jupyter notebooks.
Python
4
star
56

async-encfs-dvc

Data version control in privacy-preserving HPC workflows using DVC, EncFS, SLURM and Openstack Swift on https://castor.cscs.ch
Jupyter Notebook
4
star
57

SDSC-user-onboarding

Materials for the onboarding workshop for data scientists at SDSC
Jupyter Notebook
4
star
58

whip

whip is a small C++ abstraction layer for CUDA and HIP
CMake
4
star
59

PASC_inference

MATLAB
4
star
60

compression

C++
4
star
61

DLA-interface

Interface for Distributed Linear Algebra
C++
3
star
62

squashfs-run

Mount directories directly under `/` without sudo, using bwrap (and without overlayfs)
C
3
star
63

mchquickstart

Introduction for new MCH users
C
3
star
64

TensorFlow

Python
3
star
65

containers-hands-on

Material for tutorials and hands-on about containers
Dockerfile
3
star
66

dynamic-resource-provisioning

Ansible-powered Dynamic Storage Resource Provisioning (DSRP)
Go
2
star
67

mesa

CSM library
Rust
2
star
68

firecrest-training-2023

Python
2
star
69

DataWeaver.jl

Julia
2
star
70

spack-stack

fast spack builds on slow filesystem
Python
2
star
71

cineca-cuda

teaching materials for the CUDA @ CINECA Feb. 2016
Cuda
2
star
72

irpf90

IRPF90 is a Fortran90 preprocessor written in Python for programming using the Implicit Reference to Parameters (IRP) method. It simplifies the development of large fortran codes in the field of scientific high performance computing.
Python
2
star
73

uenv2

C++
1
star
74

benchmark-resources

benchmark_resources: input files
1
star
75

cpp-course-2024

Slides for an internal C++ course at CSCS
HTML
1
star
76

SoftwareManagementCourse2019

Exercises and slides for the Software Management Course 2019 held at CSCS
CMake
1
star
77

firecrestspawner

A JupyterHub spawner to launch notebooks servers via FirecREST.
Python
1
star
78

alps-gh200-reproducers

Reproducers for issues found on GH200 nodes on Alps
C++
1
star