Hybrid Fortran - A Framework for GPU Acceleration

What & Why

Quickstart

Clone this repo and point the HF_DIR environment variable to its path.
Type cd $HF_DIR && make example. This will create an example project directory below $HF_DIR. Note: You can move this wherever you want in order to use it as a template for your projects.
Have a look at example/source/example.h90, which will guide you through and show you how to use Hybrid Fortran. ( Psst: It's better to open it in your editor of choice with Fortran syntax highlighting, but if you want to skip steps 1-2 you may also click the link. ;-) )

A Few More Words

Hybrid Fortran is ..

.. a directive based extension for the Fortran language.
.. a way for you to keep writing your Fortran code like you're used to - only now with GPGPU support.
.. a preprocessor for your code - its input are Fortran files (with Hybrid Fortran extensions), its output is CUDA Fortran or OpenMP Fortran code (or whatever else you'd like to have as a backend).
.. a build system that handles building two separate versions (CPU / GPU) of your codebase automatically, including all the preprocessing.
.. a test system that handles verification of your outputs automatically after setup.
.. a framework for you to build your own parallel code implementations (OpenCL, ARM, FPGA, Hamster Wheel.. as long as it has some parallel Fortran support you're good) while keeping the same source files.

Hybrid Fortran has been successfully used for porting the Physical Core of Japan's national next generation weather prediction model to GPGPU. We're currently planning to port the internationally used Open Source weather model WRF to Hybrid Fortran as well.

Hybrid Fortran has been developed since 2012 by Michel Müller, MSc ETH Zurich, as a guest at Prof. Aoki's Gordon Bell award winning laboratory at the Tokyo Institute of Technology, as well as during a temporary stay with Prof. Maruyama at the RIKEN Advanced Institute for Computational Science in Kobe, Japan.

Even More Words

The following Blog entry gives insight into why Hybrid Fortran has been created and how it can help you:

Accelerators in HPC – Having the Cake and Eating It Too

License

Hybrid Fortran is available under GNU Lesser Public License (LGPL).

Sample Results

Name	Performance Results	Speedup HF on 6 Core vs. 1 Core [1]	Speedup HF on GPU vs 6 Core [1]	Speedup HF on GPU vs 1 Core [1]
Japanese Physical Weather Prediction Core (121 Kernels)	Slides Only Slidecast	4.47x	3.63x	16.22x
3D Diffusion	Link	1.06x Compare Performance	10.94x Compare Performance Compare Speedup	11.66x
Particle Push	Link	9.08x Compare Performance	21.72x Compare Performance Compare Speedup	152.79x
Poisson on FEM Solver with Jacobi Approximation	Link	1.41x	5.13x	7.28x
MIDACO Ant Colony Solver with MINLP Example	Link	5.26x	10.07x	52.99x

Code Example

The following sample code shows a wrapper subroutine and an add subroutine. Please note that storage order inefficiencies are ignored in this example (this would create an implicit copy of the z-dimension in arrays a, b, c).

module example
contains
subroutine wrapper(a, b, c)
    real, intent(in), dimension(NX, NY, NZ) :: a, b
    real, intent(out), dimension(NX, NY, NZ) :: c
    integer(4) :: x, y
    do y=1,NY
      do x=1,NX
        call add(a(x,y,:), b(x,y,:), c(x,y,:))
      end do
    end do
end subroutine

subroutine add(a, b, c)
    real, intent(in), dimension(NZ) :: a, b, c
    integer :: z
    do z=1,NZ
        c(z) = a(z) + b(z)
    end do
end subroutine
end module example

Here's what this code would look like in Hybrid Fortran, parallelizing the x and y dimensions on both CPU and GPU.

module example contains
subroutine wrapper(a, b, c)
  real, dimension(NZ), intent(in) :: a, b
  real, dimension(NZ), intent(out) :: c
  @domainDependant{domName(x,y), domSize(NX,NY), attribute(autoDom)}
  a, b, c
  @end domainDependant
  @parallelRegion{appliesTo(CPU), domName(x,y), domSize(NX, NY)}
  call add(a, b, c)
  call mult(a, b, c)
  @end parallelRegion
end subroutine

subroutine add(a, b, c)
  real, dimension(NZ), intent(in) :: a, b
  real, dimension(NZ), intent(out) :: c
  integer :: z
  @domainDependant{domName(x,y), domSize(NX,NY), attribute(autoDom)}
  a, b, c
  @end domainDependant
  @parallelRegion{appliesTo(GPU), domName(x,y), domSize(NX, NY)}
  do z=1,NZ
      c(z) = a(z) + b(z)
  end do
  @end parallelRegion
end subroutine

subroutine mult(a, b, c)
  !...
end subroutine

end module example

Please note the following:

The x and y dimensions have been abstracted away, even in the wrapper. We don't need to privatize the add subroutine in x and y as we would need to in CUDA or OpenACC. The actual computational code in the add subroutine has been left untouched.
We now have two parallelizations: For the CPU the program is parallelized at the wrapper level using OpenMP. For GPU the program is parallelized using CUDA Fortran at the callgraph leaf level. In between the two can be an arbitrarily deep callgraph, containing arbitrarily many parallel regions (with some restrictions, see below). The data copying from and to the device as well as the privatization in 3D is all handled by the Hybrid Fortran preprocessor framework.

Documentation

Published Materials

"New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code", preprint, accepted for ACM Transactions on Parallel Computing 2018
"Hybrid Fortran: High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model", in proceedings of WACCPD 2017, Denver CO, USA
Background Story (as mentioned/published in HPC Today, Inside HPC, HPCwire)
Poster SC14
Talk GTC 2014 (Voice + Slides)
Poster GTC 2013
Talk GTC 2013
Master Thesis (2012): ftp://129.132.2.212/pub/students/2012-FS/MA-2012-23_signed.pdf

muellermichel/Hybrid-Fortran

muellermichel

Reviews

Repository Details