Hybrid Fortran - A Framework for GPU Acceleration
What & Why
Quickstart
- Clone this repo and point the
HF_DIR
environment variable to its path. - Type
cd $HF_DIR && make example
. This will create an example project directory below$HF_DIR
. Note: You can move this wherever you want in order to use it as a template for your projects. - Have a look at example/source/example.h90, which will guide you through and show you how to use Hybrid Fortran. ( Psst: It's better to open it in your editor of choice with Fortran syntax highlighting, but if you want to skip steps 1-2 you may also click the link. ;-) )
A Few More Words
Hybrid Fortran is ..
- .. a directive based extension for the Fortran language.
- .. a way for you to keep writing your Fortran code like you're used to - only now with GPGPU support.
- .. a preprocessor for your code - its input are Fortran files (with Hybrid Fortran extensions), its output is CUDA Fortran or OpenMP Fortran code (or whatever else you'd like to have as a backend).
- .. a build system that handles building two separate versions (CPU / GPU) of your codebase automatically, including all the preprocessing.
- .. a test system that handles verification of your outputs automatically after setup.
- .. a framework for you to build your own parallel code implementations (OpenCL, ARM, FPGA, Hamster Wheel.. as long as it has some parallel Fortran support you're good) while keeping the same source files.
Hybrid Fortran has been successfully used for porting the Physical Core of Japan's national next generation weather prediction model to GPGPU. We're currently planning to port the internationally used Open Source weather model WRF to Hybrid Fortran as well.
Hybrid Fortran has been developed since 2012 by Michel Müller, MSc ETH Zurich, as a guest at Prof. Aoki's Gordon Bell award winning laboratory at the Tokyo Institute of Technology, as well as during a temporary stay with Prof. Maruyama at the RIKEN Advanced Institute for Computational Science in Kobe, Japan.
Even More Words
The following Blog entry gives insight into why Hybrid Fortran has been created and how it can help you:
Accelerators in HPC – Having the Cake and Eating It Too
License
Hybrid Fortran is available under GNU Lesser Public License (LGPL).
Sample Results
Name | Performance Results | Speedup HF on 6 Core vs. 1 Core [1] | Speedup HF on GPU vs 6 Core [1] | Speedup HF on GPU vs 1 Core [1] |
---|---|---|---|---|
Japanese Physical Weather Prediction Core (121 Kernels) | Slides Only Slidecast |
4.47x | 3.63x | 16.22x |
3D Diffusion | Link | 1.06x Compare Performance |
10.94x Compare Performance Compare Speedup |
11.66x |
Particle Push | Link | 9.08x Compare Performance |
21.72x Compare Performance Compare Speedup |
152.79x |
Poisson on FEM Solver with Jacobi Approximation | Link | 1.41x | 5.13x | 7.28x |
MIDACO Ant Colony Solver with MINLP Example | Link | 5.26x | 10.07x | 52.99x |
Code Example
The following sample code shows a wrapper subroutine and an add subroutine. Please note that storage order inefficiencies are ignored in this example (this would create an implicit copy of the z-dimension in arrays a, b, c).
module example
contains
subroutine wrapper(a, b, c)
real, intent(in), dimension(NX, NY, NZ) :: a, b
real, intent(out), dimension(NX, NY, NZ) :: c
integer(4) :: x, y
do y=1,NY
do x=1,NX
call add(a(x,y,:), b(x,y,:), c(x,y,:))
end do
end do
end subroutine
subroutine add(a, b, c)
real, intent(in), dimension(NZ) :: a, b, c
integer :: z
do z=1,NZ
c(z) = a(z) + b(z)
end do
end subroutine
end module example
Here's what this code would look like in Hybrid Fortran, parallelizing the x and y dimensions on both CPU and GPU.
module example contains
subroutine wrapper(a, b, c)
real, dimension(NZ), intent(in) :: a, b
real, dimension(NZ), intent(out) :: c
@domainDependant{domName(x,y), domSize(NX,NY), attribute(autoDom)}
a, b, c
@end domainDependant
@parallelRegion{appliesTo(CPU), domName(x,y), domSize(NX, NY)}
call add(a, b, c)
call mult(a, b, c)
@end parallelRegion
end subroutine
subroutine add(a, b, c)
real, dimension(NZ), intent(in) :: a, b
real, dimension(NZ), intent(out) :: c
integer :: z
@domainDependant{domName(x,y), domSize(NX,NY), attribute(autoDom)}
a, b, c
@end domainDependant
@parallelRegion{appliesTo(GPU), domName(x,y), domSize(NX, NY)}
do z=1,NZ
c(z) = a(z) + b(z)
end do
@end parallelRegion
end subroutine
subroutine mult(a, b, c)
!...
end subroutine
end module example
Please note the following:
-
The x and y dimensions have been abstracted away, even in the wrapper. We don't need to privatize the add subroutine in x and y as we would need to in CUDA or OpenACC. The actual computational code in the add subroutine has been left untouched.
-
We now have two parallelizations: For the CPU the program is parallelized at the wrapper level using OpenMP. For GPU the program is parallelized using CUDA Fortran at the callgraph leaf level. In between the two can be an arbitrarily deep callgraph, containing arbitrarily many parallel regions (with some restrictions, see below). The data copying from and to the device as well as the privatization in 3D is all handled by the Hybrid Fortran preprocessor framework.
Documentation
- Samples Overview
- Results Overview
- Screencast
- Full Documentation For Installation, Getting Started, Usage and Design (PDF)
- Credits
- Contact Information
- Version History
Published Materials
- "New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code", preprint, accepted for ACM Transactions on Parallel Computing 2018
- "Hybrid Fortran: High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model", in proceedings of WACCPD 2017, Denver CO, USA
- Background Story (as mentioned/published in HPC Today, Inside HPC, HPCwire)
- Poster SC14
- Talk GTC 2014 (Voice + Slides)
- Poster GTC 2013
- Talk GTC 2013
- Master Thesis (2012): ftp://129.132.2.212/pub/students/2012-FS/MA-2012-23_signed.pdf