• Stars
    star
    733
  • Rank 61,835 (Top 2 %)
  • Language
    C++
  • License
    BSD 3-Clause "New...
  • Created about 8 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scalable Network Stack for FPGAs (TCP/IP, RoCEv2)

Scalable Network Stack supporting TCP/IP, RoCEv2, UDP/IP at 10-100Gbit/s

Table of contents

  1. Getting Started
  2. Compiling HLS modules
  3. Interfaces
    1. TCP/IP
    2. ROCE
  4. Benchmarks
  5. Publications
  6. Citation
  7. Contributors

Getting Started

Prerequisites

  • Xilinx Vivado 2019.1
  • cmake 3.0 or higher

Supported boards (out of the box)

  • Xilinx VC709
  • Xilinx VCU118
  • Alpha Data ADM-PCIE-7V3

Compiling (all) HLS modules and install them to your IP repository

  1. Optionally specify the location of your IP repository:
export $IPREPO_DIR=/home/myname/iprepo
  1. Create a build directory
mkdir build
cd build

2.a) Configure build

cmake .. -DDATA_WIDTH=64 -DCLOCK_PERIOD=3.1 -DFPGA_PART=xcvu9p-flga2104-2L-e -DFPGA_FAMILY=ultraplus -DVIVADO_HLS_ROOT_DIR=/opt/Xilinx/Vivado//2019.1/bin/

2.b)Alternatively you can use one the board name ot configure your build

cmake .. -DDEVICE_NAME=vcu118

All cmake options:

Name Values Desription
DEVICE_NAME <vc709,vcu118,adm7v3> Supported devices
NETWORK_BANDWIDTH <10,100> Bandwidth of the Ethernet interface in Gbit/s, default depends on board
FPGA_PART Name of the FPGA part, e.g. xc7vx690tffg1761-2
FPGA_FAMILY <7series,ultraplus> Name of the FPGA part family
DATA_WIDTH <8,16,32,64> Data width of the network stack in bytes
CLOCK_PERIOD Clock period in nanoseconds, e.g. 3.1 for 100G, 6.4 for 10G
TCP_STACK_MSS Maximum segment size of the TCP/IP stack
TCP_STACK_WINDOW_SCALING_EN <0,1> Enalbing TCP Window scaling option
VIVADO_HLS_ROOT_DIR Path to Vivado HLS directory, e.g. /opt/Xilinx/Vivado/2019.1
  1. Build HLS IP cores and install them into IP repository
make installip

For an example project including the TCP/IP stack or the RoCEv2 stack with DMA to host memory checkout our Distributed Accelerator OS DavOS.

Working with individual HLS modules

  1. Setup build directory, e.g. for the TCP module
$ cd hls/toe
$ mkdir build
$ cd build
$ cmake .. -DFPGA_PART=xcvu9p-flga2104-2L-e -DDATA_WIDTH=8 -DCLOCK_PERIOD=3.1
  1. Run c-simulation
$ make csim
  1. Run c-synthesis
$ make synthesis
  1. Generate HLS IP core
$ make ip
  1. Install HLS IP core into the IP repository
$ make installip

Interfaces

All interfaces are using the AXI4-Stream protocol. For AXI4-Streams carrying network/data packets, we use the following definition in HLS:

template <int D>
struct net_axis
{
	ap_uint<D>    data;
	ap_uint<D/8>  keep;
	ap_uint<1>    last;
};

TCP/IP

Open Connection

To open a connection the destination IP address and TCP port have to provided through the s_axis_open_conn_req interface. The TCP stack provides an answer to this request through the m_axis_open_conn_rsp interface which provides the sessionID and a boolean indicating if the connection was openend successfully.

Interface definition in HLS:

struct ipTuple
{
	ap_uint<32>	ip_address;
	ap_uint<16>	ip_port;
};
struct openStatus
{
	ap_uint<16>	sessionID;
	bool		success;
};

void toe(...
	hls::stream<ipTuple>& openConnReq,
	hls::stream<openStatus>& openConnRsp,
	...);

Close Connection

To close a connection the sessionID has to be provided to the s_axis_close_conn_req interface. The TCP/IP stack does not provide a notification upon completion of this request, however it is guranteeed that the connection is closed eventually.

Interface definition in HLS:

hls::stream<ap_uint<16> >& closeConnReq,

Open a TCP port to listen on

To open a port to listen on (e.g. as a server), the port number has to be provided to s_axis_listen_port_req. The port number has to be in range of active ports: 0 - 32767. The TCP stack will respond through the m_axis_listen_port_rsp interface indicating if the port was set to the listen state succesfully.

Interface definition in HLS:

hls::stream<ap_uint<16> >& listenPortReq,
hls::stream<bool>& listenPortRsp,

Receiving notifications from the TCP stack

The application using the TCP stack can receive notifications through the m_axis_notification interface. The notifications either indicate that new data is available or that a connection was closed.

Interface definition in HLS:

struct appNotification
{
	ap_uint<16>			sessionID;
	ap_uint<16>			length;
	ap_uint<32>			ipAddress;
	ap_uint<16>			dstPort;
	bool				closed;
};

hls::stream<appNotification>& notification,

Receiving data

If data is available on a TCP/IP session, i.e. a notification was received. Then this data can be requested through the s_axis_rx_data_req interface. The data as well as the sessionID are then received through the m_axis_rx_data_rsp_metadata and m_axis_rx_data_rsp interface.

Interface definition in HLS:

struct appReadRequest
{
	ap_uint<16> sessionID;
	ap_uint<16> length;
};

hls::stream<appReadRequest>& rxDataReq,
hls::stream<ap_uint<16> >& rxDataRspMeta,
hls::stream<net_axis<WIDTH> >& rxDataRsp,

Waveform of receiving a (data) notification, requesting data, and receiving the data:

signal tcp-rx-handshake

Transmitting data

When an application wants to transmit data on a TCP connection, it first has to check if enough buffer space is available. This check/request is done through the s_axis_tx_data_req_metadata interface. If the response through the m_axis_tx_data_rsp interface from the TCP stack is positive. The application can send the data through the s_axis_tx_data_req interface. If the response from the TCP stack is negative the application can retry by sending another request on the s_axis_tx_data_req_metadata interface.

Interface definition in HLS:

struct appTxMeta
{
	ap_uint<16> sessionID;
	ap_uint<16> length;
};
struct appTxRsp
{
	ap_uint<16> sessionID;
	ap_uint<16> length;
	ap_uint<30> remaining_space;
	ap_uint<2>  error;
};

hls::stream<appTxMeta>& txDataReqMeta,
hls::stream<appTxRsp>& txDataRsp,
hls::stream<net_axis<WIDTH> >& txDataReq,

Waveform of requesting a data transmit and transmitting the data. signal tcp-tx-handshake

RoCE (RDMA over Converged Ethernet)

Load Queue Pair (QP)

Before any RDMA operations can be executed the Queue Pairs have to established out-of-band (e.g. over TCP/IP) by the hosts. The host can the load the QP into the RoCE stack through the s_axis_qp_interface and s_axis_qp_conn_interface interface.

Interface definition in HLS:

typedef enum {RESET, INIT, READY_RECV, READY_SEND, SQ_ERROR, ERROR} qpState;

struct qpContext
{
	qpState		newState;
	ap_uint<24> qp_num;
	ap_uint<24> remote_psn;
	ap_uint<24> local_psn;
	ap_uint<16> r_key;
	ap_uint<48> virtual_address;
};
struct ifConnReq
{
	ap_uint<16> qpn;
	ap_uint<24> remote_qpn;
	ap_uint<128> remote_ip_address;
	ap_uint<16> remote_udp_port;
};

hls::stream<qpContext>&	s_axis_qp_interface,
hls::stream<ifConnReq>&	s_axis_qp_conn_interface,

Issue RDMA commands

RDMA commands can be issued to RoCE stack through the s_axis_tx_meta interface. In case the commands transmits data. This data can be either originate from the host memory as specified by the local_vaddr or can originate from the application on the FPGA. In the latter case the local_vaddr is set to 0 and the data is provided through the s_axis_tx_data interface.

Interface definition in HLS:

typedef enum {APP_READ, APP_WRITE, APP_PART, APP_POINTER, APP_READ_CONSISTENT} appOpCode;

struct txMeta
{
	appOpCode 	op_code;
	ap_uint<24> qpn;
	ap_uint<48> local_vaddr;
	ap_uint<48> remote_vaddr;
	ap_uint<32> length;
};
hls::stream<txMeta>& s_axis_tx_meta,
hls::stream<net_axis<WIDTH> >& s_axis_tx_data,

Waveform of issuing a RDMA read request: signal roce-read-handshake

Waveform of issuing an RDMA write request where data on the FPGA is transmitted: signal roce-write-handshake

Benchmarks

(Coming soon)

Publications

  • D. Sidler, G. Alonso, M. Blott, K. Karras et al., Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware, in FCCMโ€™15, Paper, Slides

  • D. Sidler, Z. Istvan, G. Alonso, Low-Latency TCP/IP Stack for Data Center Applications, in FPL'16, Paper

  • D. Sidler, Z. Wang, M. Chiosa, A. Kulkarni, G. Alonso, StRoM: smart remote memory, in EuroSys'20, Paper

Citation

If you use the TCP/IP stack or the RDMA stack in your project please cite one of the following papers and/or link to the github project:

@INPROCEEDINGS{sidler2015tcp, 
	author={D. Sidler and G. Alonso and M. Blott and K. Karras and others}, 
	booktitle={FCCM'15}, 
	title={{Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware}}, 
}
@INPROCEEDINGS{sidler2016lowlatencytcp, 
	author={D. Sidler and Z. Istvan and G. Alonso}, 
	booktitle={FPL'16}, 
	title={{Low-Latency TCP/IP Stack for Data Center Applications}}, 
}
@PHDTHESIS{sidler2019innetworkdataprocessing,
	author = {Sidler, David},
	publisher = {ETH Zurich},
	year = {2019-09},
	copyright = {In Copyright - Non-Commercial Use Permitted},
	title = {In-Network Data Processing using FPGAs},
}
@INPROCEEDINGS{sidler2020strom,
	author = {Sidler, David and Wang, Zeke and Chiosa, Monica and Kulkarni, Amit and Alonso, Gustavo},
	booktitle = {Proceedings of the Fifteenth European Conference on Computer Systems},
	title = {StRoM: Smart Remote Memory},
	doi = {10.1145/3342195.3387519},
}

Contributors

More Repositories

1

spooNN

FPGA-based neural network inference project with an end-to-end approach (from training to implementation to deployment)
Jupyter Notebook
250
star
2

Coyote

Framework providing operating system abstractions and a range of shared networking (RDMA, TCP/IP) and memory services to common modern heterogeneous platforms.
SystemVerilog
205
star
3

Vitis_with_100Gbps_TCP-IP

100 Gbps TCP/IP stack for Vitis shells
C++
152
star
4

caribou

Caribou: Distributed Smart Storage built with FPGAs
Verilog
62
star
5

davos

Distributed Accelerator OS
SystemVerilog
59
star
6

ZipML-PYNQ

Linear model training using stochastic gradient descent (SGD) on PYNQ with full to low precision.
Jupyter Notebook
53
star
7

doppiodb

doppioDB - A hardware accelerated database
C
45
star
8

ZipML-XeonFPGA

FPGA-based stochastic gradient descent (powered by ZipML - Low-precision machine learning on reconfigurable hardware)
VHDL
31
star
9

hacc

ETHZ Heterogeneous Accelerated Compute Cluster.
26
star
10

Centaur

Centaur, a framework for hybrid CPU-FPGA databases
Verilog
24
star
11

groundhog

Groundhog - Serial ATA Host Bus Adapter
Verilog
17
star
12

dma-driver

SystemVerilog
15
star
13

GPU-FPGA-Recommendation-System

FleetRec: Large-Scale Recommendation Inference on Hybrid GPU-FPGA Clusters
C++
14
star
14

DecisionTrees

Decision Trees Inference
SystemVerilog
14
star
15

erbium

Business Rule Engine Hardware Accelerator
VHDL
13
star
16

FPGA-Recommendation-Accelerator

MLSys 2021 paper: MicroRec: efficient recommendation inference by hardware and data structure solutions
C++
13
star
17

fpga-hyperloglog

FPGA-based HyperLogLog Accelerator
C++
12
star
18

parti-fpga

FPGA-based data partitioning
VHDL
8
star
19

strega

An HTTP Server for FPGAs
C++
8
star
20

MLWeaving

Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-Precision Learning
C++
8
star
21

Distributed-DecisionTrees

SystemVerilog
8
star
22

ColumnML

Generalized linear model training on column-stores with FPGA-enhanced data transformation
VHDL
6
star
23

PipeArch

VHDL
6
star
24

SKT

A One-Pass Multi-Sketch Data Analytics Accelerator
C++
5
star
25

Distributed_Recommendation_Inference_on_FPGA_Clusters

C++
4
star
26

Coyote-CIRCT

Deploy CIRCT generated circuits with a streaming abstraction (circt-stream) effortlessly through Coyote.
SystemVerilog
4
star
27

hashing-XeonFPGA

Verilog
3
star
28

gcow

๐Ÿ„ Gradient compression on the wire
C++
2
star
29

bit_serial_kmeans

SystemVerilog
2
star
30

loadbalancer

SystemVerilog
1
star
31

hw-acceleration-of-compression-and-crypto

C++
1
star
32

hacc-platform

Hardware ACCeleration Platform.
Shell
1
star