• Stars
    star
    135
  • Rank 269,297 (Top 6 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created over 4 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

HLS-based Graph Processing Framework on FPGAs

logo

GitHub license GitHub issues DOI

ThunderGP: HLS-based Graph Processing Framework on FPGAs

What's new?

ThunderGP for HBM-Enabled FPGA platforms has been released. Please see the branch.

ThunderGP has been published in ACM Transactions on Reconfigurable Technology and Systems (TRETS), as one of "Best Papers in FPGA 2021".

ThunderGP won the third place in 2020 Xilinx Adaptive Computing Developer Contest, top 9 out of 79 teams.

ThunderGP is accepted to be FPGA 2021. Read the paper.

ThunderGP is featured at Xilinx Apps and Libraries.

ThunderGP was presented at XACC@NUS Workshop Series 2020: Reconfigurable Computing Systems. see Slides, Video/Youtube, Video/bilibili.

Introduction

ThunderGP enables data scientists to enjoy the performance of FPGA-based graph processing without compromising programmability. To our best knowledge and experiments, this is the fastest graph processing framework on HLS-based FPGAs.

Two aspacts make the ThunderGP deliver superior performance. On the one hand, ThunderGP embraces an improved execution flow to better exploit the pipeline parallelism of FPGA and alleviate the data access amount to the global memory. On the other hand, the memory accesses are highly optimized to fully utilize the memory bandwidth capacity of the hardware platforms.

On Xilinx multi-SLR based FPGAs, it is running at 250Mhz, and the performance can be up to 6400 MTEPS (million traversed edges per second), or a 2.9 times speedup over the state-of-the-art.

Prerequisites

  • The gcc-9.3
  • Tools:
    • SDAccel 2018.3 Design Suit
    • SDAccel 2019.1 Design Suit
  • Evaluated platforms from Xilinx:
    • Xilinx Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit (SDAccel 2018.3)
    • Alveo U200 Data Center Accelerator Card (SDAccel 2019.1)
    • Alveo U250 Data Center Accelerator Card (SDAccel 2019.1)

Run the code

ThunderGP currently has seven build-in graph algorithms: PageRank (PR), Sparse Matrix-Vector Multiplication (SpMV), Breadth-First Search (BFS), Single Source Shortest Path (SSSP), Closeness Centrality (CC), ArticleRank (AR), and Weakly Connected Component (WCC). The desired application can be implemented by passing argument app=[the algorithm] to make command. The below table is for quick reference.

Argument Accelerated algorithm
app=pr PageRank (PR)
app=spmv Sparse Matrix-vector Multiplication (SpMV)
app=bfs Breadth First Search (BFS)
app=sssp Single Source Shortest Path (SSSP)
app=cc Closeness Centrality (CC)
app=ar ArticleRank (AR)
app=wcc Weakly Connected Component (WCC)

Here is the example of implementing the accelerator for PageRank on Alveo U250 platform with SDAccel 2019.1.

$ git clone https://github.com/Xtra-Computing/ThunderGP.git
$ cd ./
$ vim ThunderGP.mk 
$ # configure the DEVICE as DEVICES := xilinx_u250_xdma_201830_2; configure TARGETS := hw
$ make app=pr all # make the host execution program and the FPGA bitstream. It takes time :)
# For execution on real hardware. The path of graph dataset needs to be provided by the user. 
$ ./host_graph_fpga_pr xclbin_pr/*.xclbin ./dataset/rmat-14-32.txt

More details: Compiling ThunderGP

Results (performance)

Throughput (MTEPS) of different graph processing algorithms over datasets on VCU1525 platform.

App. PR SPMV BFS SSSP CC AR WCC
R21 5,015 4,190 5,417 3,901 4,623 4,848 4,584
R24 4,599 3,781 3,437 3,430 4,339 4,486 4,328
G24 5,039 4,037 5,216 3,725 4,752 4,927 4,704
G25 4,464 3,615 4,660 3,343 4,344 4,389 4,356
WT 2,884 2,874 2,717 2,427 2,776 2,833 2,776
MG 4,454 3,883 4,939 3,699 4,077 4,285 4,088
PK 4,001 3,729 4,251 3,169 3,833 3,909 3,716
WP 3,030 2,994 3,112 2,491 2,993 2,931 2,894
LJ 3,186 3,003 3,408 2,623 3,113 3,081 3,099
TW 2,938 2,801 2,120 2,425 2,962 2,853 2,894

Throughput (MTEPS) of different graph processing algorithms over datasets on U250 platform.

App. PR SPMV BFS SSSP CC AR WCC
R21 4,669 5,056 6,028 4,879 4,783 4,667 4,901
R24 4,732 4,946 5,897 4,285 4,939 4,732 4,988
G24 5,040 5,305 5,772 4,428 3,705 5,040 5,303
G25 4,978 4,072 4,974 3,864 3,661 4,984 5,254
WT 2,251 2,938 2,630 2,583 2,369 2,253 2,405
MG 3,756 4,195 4,949 4,378 3,914 3,737 3,891
PK 3,630 4,372 4,629 3,927 3,865 3,662 3,841
WP 3,255 3,652 4,058 3,417 3,341 3,259 3,432
LJ 3,342 3,693 4,329 3,614 3,557 3,328 3,708
TW 3,538 3,959 3,671 3,585 3,759 3,533 3,806

APIs (programmability)

auto

Benefiting from the high level abstraction of HLS, our APIs natively support C/C++ languages.
ThunderGraph covers three levels of API for implementation or further exploration. APIs in L1 and L2 are for building the accelerators, and APIs of L3 are for host program. Details are as below:

Framework Overview

The Adopted Computation Model

The Gather-Apply-Scatter (GAS) model is widely used for FPGA-based graph processing frameworks as computation model due to its extensibility to various graph processing algorithms. ThunderGP adopts a simplified version of GAS model by following work On-the-fly-data-shuffling-for-OpenCL-based-FPGAs. This model updates the vertex property by propagating from source vertex to destination vertex. The input for the model is an unordered set of directed edges of the graph. Undirected edges in a graph can be represented by a pair of directed edges.

drawing

The process per iteration mainly contains three stages: Scatter, Gather, and Apply.

  • In the Scatter stage (shown in line 2 to 6), for each input edge with format <src, dst, weight>, an update tuple is generated for the destination vertex of the edge. The update tuple is of the format <dst, value>, where the dst is the destination vertex of the edge and value is generated by processing the vertex properties and edge weights.
  • In the Gather stage (shown in line 7 to 9), all the update tuples generated in the Scatter stage are accumulated to update destination vertices.
  • The final Apply stage (shown in line 10 to 12) executes an apply function on all the vertices of the graph.

The Execution Flow of ThunderGP

overview

As shown in the above diagram, The edges in one partition are streamed into Scatter stage, For each edges, the property of source vertices will be fetched from the global memory by the per-fetching and the cache module, at the same time, the property of corresponding edge, or the weight of edge is loaded from global memory in stream, then these two value go through an algorithm-specific processing which return an update of the property of the destination vertex, finally, at the end of scatter stage, this update value and the destination of this edge is combined to create a update tuple. The update tuples are streamed into the shuffle stage which dispatches the tuples to corresponding gather processing engines(PEs). The Gather PEs accumulates the update value in local on-chip memory which is caching the property of destination vertices. After all the edges in this partition are processed, the cached data in gather PEs will be aggregated to the global memory. and the Apply stage which calls algorithm-specific function updates all the vertices for the next iteration.

Future Work

  • Application wrapper for high level platform (Spark, etc.)
  • Hardware-accelerated query engine.
  • Cycle-precision software simulation for the verification of dynamic modules(Cache, etc.) and channel depth tuning.
  • Optimization for large scale graph. (distributed processing or HBM-based memory hierarchy)

How to cite ThunderGP

If you use ThunderGP in your paper, please cite our work (full version).

@article{10.1145/3517141,
author = {Chen, Xinyu and Cheng, Feng and Tan, Hongshi and Chen, Yao and He, Bingsheng and Wong, Weng-Fai and Chen, Deming},
title = {ThunderGP: Resource-Efficient Graph Processing Framework on FPGAs with HLS},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {1936-7406},
url = {https://doi.org/10.1145/3517141},
doi = {10.1145/3517141},
note = {Just Accepted},
journal = {ACM Trans. Reconfigurable Technol. Syst.},
month = {feb},
keywords = {FPGA; HBM; HLS; Graph Processing; Framework}
}

@inbook{10.1145/3431920.3439290,
author = {Chen, Xinyu and Tan, Hongshi and Chen, Yao and He, Bingsheng and Wong, Weng-Fai and Chen, Deming},
title = {ThunderGP: HLS-Based Graph Processing Framework on FPGAs},
year = {2021},
url = {https://doi.org/10.1145/3431920.3439290},
booktitle = {The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
pages = {69–80},
numpages = {12}
}

Related publications

Related systems

Key members

Acknowledgement

More Repositories

1

thundersvm

ThunderSVM: A Fast SVM Library on GPUs and CPUs
C++
1,564
star
2

thundergbm

ThunderGBM: Fast GBDTs and Random Forests on GPUs
C++
692
star
3

NIID-Bench

Federated Learning Benchmark - Federated Learning on Non-IID Data Silos: An Experimental Study (ICDE 2022)
Python
558
star
4

FedTree

A tree-based federated learning system (MLSys 2023)
C++
142
star
5

Medusa

Medusa: Building GPU-based Parallel Sparse Graph Applications with Sequential C/C++ Code
Cuda
61
star
6

Awesome-Literature-ILoGs

Awesome literature on imbalanced learning on graphs
58
star
7

G3

G3: A Programmable GNN Training System on GPU
Cuda
42
star
8

briskstream

A Multicore, NUMA Optimised Data Stream Processing System
Java
39
star
9

PyOE

Python library for data stream learning
Python
28
star
10

ThunderRW

Source code of "ThunderRW: An In-Memory Graph Random Walk Engine" published in VLDB'2021 - By Shixuan Sun, Yuhang Chen, Shengliang Lu, Bingsheng He and Yuchen Li
C++
26
star
11

FedSim

A coupled vertical federated learning framework that boosts the model performance with record similarities (NeurIPS 2022)
Python
23
star
12

PrivML

20
star
13

SOFF

Python
19
star
14

ConsisGAD

Python
18
star
15

SimFL

Practical Federated Gradient Boosting Decision Trees (AAAI 2020)
C++
18
star
16

ForkGraph

C++
16
star
17

ReGraph

Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines
C++
16
star
18

ThundeRiNG

Fast Multiple Independent Random Number Sequences Generation on FPGAs
C++
14
star
19

hacc_demo

Shell
14
star
20

FedOV

Towards Addressing Label Skews in One-Shot Federated Learning (ICLR 2023)
Python
14
star
21

Vine

Accelerating Exact Constrained Shortest Paths on GPUs
C++
14
star
22

PathEnum

Source code of "PathEnum: Towards Real-Time Hop-Constrained s-t Path Enumeration", published in SIGMOD'2021 - By Shixuan Sun, Yuhang Chen, Bingsheng He, and Bryan Hooi
C++
14
star
23

OEBench

OEBench: Investigating Open Environment Challenges in Real-World Relational Data Streams (VLDB 2024)
Python
13
star
24

VertiBench

Feature partitioner by imbalance or correlation (ICLR 2024)
Jupyter Notebook
9
star
25

omniDB

General query processing engine
C++
7
star
26

LightRW

C++
6
star
27

HashjoinOnHARP

The MAIN project of the paper "Is FPGA useful for Hash Joins?"
C++
5
star
28

PMP

Python
5
star
29

RUSH

A fast library for real-time burst subgraph detection
Python
4
star
30

On-the-fly-data-shuffling-for-OpenCL-based-FPGAs

JavaScript
4
star
31

DeltaBoost

GBDT-based model with efficient unlearning (SIGMOD 2023)
C++
4
star
32

ModelGo

TeX
4
star
33

Pyper

3
star
34

KGraph

Concurrent Graph Query Processing with Memoization on Graph
3
star
35

Awesome-Prompt-For-Research

Awesome prompts for computer science research including paper editting and code debugging
2
star
36

Melia

C
2
star
37

Query_on_OpenCL_FPGA

C++
1
star
38

FedGMA

Communication-Efficient Generalized Neuron Matching for Federated Learning (ICPP'23)
Python
1
star
39

HashJoin_HMA

A hash join implementation optimized for many-core processors with die-stacked HBMs
C++
1
star
40

Clementi

Clementi: A Scalable Multi-FPGA Graph Processing Framework
C++
1
star