• Stars
    star
    180
  • Rank 213,097 (Top 5 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created over 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

DiSNI: Direct Storage and Networking Interface

DiSNI: Direct Storage and Networking Interface

DiSNI is a Java library for direct storage and networking access from userspace. It provides an RDMA interface to access remote memory. DiSNI enables the development of Java applications for high performance RDMA networks, such as InfiniBand, iWARP, or RoCE. The RDMA API is implemented based on the Open Fabrics Enterprise Distribution (OFED) RDMA user libraries. It provides RDMA semantics including asynchronous operations, zero-copy transmission and direct data placement.

Changelog

  • Version 1.5 removes NVMf/SPDK code. For user of the DiSNI NVMf API we provide a new NVMf library called jNVMf

Building DiSNI

Building the source requires Apache Maven and GNU/autotools and Java version 8 or higher. To build DiSNI and its example programs, obtain a copy of DiSNI from Github and execute the following steps:

  1. Compile the Java sources using: mvn -DskipTests install
  2. Compile libdisni using: cd libdisni; ./autoprepare.sh; ./configure --with-jdk=<path>; make install

How to Run the Examples

Common steps:

  1. After building DiSNI, make sure DiSNI and its dependencies are in the classpath (e.g., disni-1.5-jar-with-dependencies.jar). Also add the DiSNI test jar (disni-1.5-tests.jar) which includes the examples.
  2. Make sure libdisni is part of the LD_LIBRARY_PATH

RDMA example

  1. Make sure the RDMA network interface is configured and up on the test machines (run ibv_devices to see the list of RDMA NICs). If your machine does not have RDMA hardware, you can also use SoftiWARP from Github.
  2. Run the server: java com.ibm.disni.examples.ReadServer -a <server IP>
  3. Run the client: java com.ibm.disni.examples.ReadClient -a <server IP>

Programming with DiSNI

DiSNI is part of maven central, therefore the simplest way to use DiSNI in your maven application is to add the following snippet to your application pom.xml file.

<dependency>
  <groupId>com.ibm.disni</groupId>
  <artifactId>disni</artifactId>
  <version>1.5</version>
</dependency>

The DiSNI API follows a Group/Endpoint model which is based on three key data types (interfaces):

  • DiSNIServerEndpoint:
    • represents a listerning server waiting for new connections
    • contains methods to bind() to a specific port and to accept() new connections
  • DiSNIEndpoint:
    • represents a connection to a remote (or local) resource (e.g., RDMA)
    • offers non-blocking methods to read() or write() the resource
  • DiSNIGroup:
    • a container and a factory for both client and server endpoints

Specific implementations of these interface like DiSNI/RDMA offers extra functionality tailored to their purpose.

Stateful Operations

To avoid any performance impacts that are associated with passing complex parameters and arrays through the JNI interface, the DiSNI library implements stateful method calls (SMC). With this approach, the JNI serialization state for a particular call is cached in the context of an SMC object and can be reused many times. SMC objects can also be modified, for instance when transmitting data at different offsets. Modifications to SMC objects are efficient as they do not require serialization. It is key that SMC objecs are re-used whenever possible to avoid garbage collection overheads.

Programming RDMA using DiSNI

Here are the basic steps that are necessary to develop an RDMA client/server application using DiSNI. First, define your own custom endpoints by extending either extending RdmaClientEndpoint or RdmaActiveClientEndpoint

public class CustomServerEndpoint extends RdmaActiveClientEndpoint {
	public void init() throws IOException{
		super.init();
		//allocate and register buffers
		//initiate postRecv call to pre-post some recvs if necessary
		//...
	}
}

Implement a factory for your custom endpoints

public class CustomFactory implements RdmaEndpointFactory<CustomServerEndpoint> {
	private RdmaActiveEndpointGroup<CustomServerEndpoint> endpointGroup;

	public CustomServerEndpoint createClientEndpoint(RdmaCmId idPriv) throws IOException {
		return new CustomServerEndpoint(endpointGroup, idPriv);
	}	
}

At the server, allocate an endpoint group and initialize it with the factory, create a server endpoint, bind it and accept connections

RdmaActiveEndpointGroup endpointGroup = new RdmaActiveEndpointGroup<CustomServerEndpoint>();
CustomFactory factory = new CustomFactory(endpointGroup);
endpointGroup.init(factory);
RdmaServerEndpoint<CustomServerEndpoint> endpoint = endpointGroup.createServerEndpoint();
endpoint.bind(address);
CustomServerEndpoint endpoint = serverEndpoint.accept();

At the client, also create a custom endpoint and factory (not shown) and connect your endpoint to the server

RdmaActiveEndpointGroup endpointGroup = new RdmaActiveEndpointGroup<CustomClientEndpoint>();
CustomFactory factory = new CustomFactory(endpointGroup);
endpointGroup.init(factory);
CustomClientEndpoint endpoint = endpointGroup.createClientEndpoint();
endpoint.connect(address);

Once an endpoint is connected, RDMA data operations can be issued. For this, a descriptor that encodes the operation will have to be prepared. The descriptor encodes the type of operation (read, write, send, recv) and points to the data buffer that is involved.

IbvMr mr = endpoint.registerMemory(buffer).execute().free();
IbvSendWR sendWR = endpoint.getSendWR();
sendWR.setOpcode(IbvSendWR.IBV_WR_RDMA_READ);
sendWR.getRdma().setRemote_addr(mr.getAddr());
sendWR.getRdma().setRkey(lkey);

To trigger the operation, a list of descriptors will have to be posted onto the connection. Each descriptor may further have multiple scatter/gather elements.

SVCPostSend postSend = endpoint.postSend(decriptorList);
postSend.execute();

A completion event is created by the network interface after the data buffer has been DMA's to the NIC. Depending on which type of endpoint group that is used, the event is signaled either through a callback, or has to be polled manually by the application. Once the completion event has been consumed, the data buffer can be reused.

A good example showcasing the use of SMC can be found in JVerbsReadClient.java:

SVCPostSend postSend = endpoint.postSend(endpoint.getWrList_send());
for (int i = 10; i <= 100; ){
	postSend.getWrMod(0).getSgeMod(0).setLength(i);
	postSend.execute();
	//wait until the operation has completed
	endpoint.getWcEvents().take();
		
	//we should have the content of the remote buffer in our own local buffer now
	ByteBuffer dataBuf = endpoint.getDataBuf();
	dataBuf.clear();
	System.out.println("ReadClient::read memory from server: " + dataBuf.asCharBuffer().toString());		
	i += 10;
}
postSend.free();

Choosing the EndpointGroup

As mentioned earlier, EndpointGroups are containers and factories for RDMA connections (RdmaEndpoint). There are two types of groups available in the RDMA API, and which type works best depends on the application. The RdmaActiveEndpointGroup actively processes network events caused by RDMA messages being transmitted or received. Events are signaled by calling dispatchCqEvent() which can be overriden by the custom endpoint of the application. The RdmaPassiveEndpointGroup provides a polling interface that allows the application to directly reap completion events from the network queue (completion queue). As such, the passive mode has typically lower latency but may suffer from contention in case of large numbers of threads operating on the same connection. The active mode, on the other hand, is more robust under large numbers of threads, but has higher latencies. Often it is the best option to use active endpoints at the server, and passive connections at the client. Passive endpoints are typically the right choice if the application knows when messages will be received and, thus, can poll the completion queue accordingly.

Publications

Contributions

PRs are always welcome. Please fork, and make necessary modifications you propose, and let us know.

Contact

If you have questions or suggestions, feel free to post at:

https://groups.google.com/forum/#!forum/zrlio-users

or email: [email protected]

More Repositories

1

softiwarp

SoftiWARP: Software iWARP kernel driver and user library for Linux
C
126
star
2

urdma

Verbs on DPDK
C
100
star
3

crail

[Archived] A Fast Multi-tiered Distributed Storage System based on User-Level I/O
Java
71
star
4

darpc

DaRPC: Data Center Remote Procedure Call
Java
50
star
5

crail-spark-io

Fast I/O plugins for Spark
Scala
41
star
6

hyv

A hybrid I/O virtualization framework for RDMA-capable network interfaces
C
32
star
7

jNVMf

A NVMf library for Java
Java
27
star
8

parquet-generator

Parquet file generator
Scala
22
star
9

albis

Albis: High-Performance File Format for Big Data Systems
21
star
10

jaio

Java API for libaio
Java
14
star
11

Python-c-benchmark

Python -> C/C++ function call overhead benchmark
Python
8
star
12

crail-blkdev

Crail storage on shared volume block devices
Java
6
star
13

softiwarp-user-for-linux-rdma

RFC: adds most recent version of softiwarp user library to rdma-core, aiming at linux-rdma acceptance
C
6
star
14

narpc

RPC library based on Java NIO
Java
4
star
15

crail-spark-terasort

An example TeraSort program on spark-io and crail
Scala
4
star
16

sql-benchmarks

A set of Spark SQL benchmarks
Scala
4
star
17

crail-s3

Crail storage tier for S3-compatible object stores
Java
3
star
18

spark-hcs

Apache Spark with HCS integration
3
star
19

hcs

Source code for the Heterogeneous Cluster Scheduler (HCS)
2
star
20

softiwarp-for-linux-rdma

SoftiWarp driver integration with rdma/for-next branch of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
C
1
star
21

crail-dispatcher

Accept put/get requests through loopback RPC and dispatches the operation on Crail
Python
1
star
22

spark-nullio-fileformat

Spark Null I/O file format
Scala
1
star
23

crail-project

[Archived] High-Performance Data Processing using Fast Storage and Networking Hardware
HTML
1
star