• This repository has been archived on 01/Jul/2023
  • Stars
    star
    247
  • Rank 164,117 (Top 4 %)
  • Language
    C++
  • License
    Other
  • Created about 5 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A tensor-aware point-to-point communication primitive for machine learning

TensorPipe

The TensorPipe project provides a tensor-aware channel to transfer rich objects from one process to another while using the fastest transport for the tensors contained therein (e.g., CUDA device-to-device copy).

Getting started

First clone the repository:

$ git clone --recursive https://github.com/pytorch/tensorpipe

Then, build as follows (using ninja instead of make):

$ cd tensorpipe
$ mkdir build
$ cd build
$ cmake ../ -GNinja
$ ninja

You can find test executables in build/tensorpipe/test.

Interface

There are four classes you need to know about:

  • tensorpipe::Context, which keeps track of the global state of the system, such as thread pools, open file descriptors, etc.
  • tensorpipe::Listener, which allows one process to open an entry point for other processes to connect to.
  • tensorpipe::Pipe, the one communication primitive that this entire project is about. You can obtain one either by connecting to the listener of another process or from such a listener when another process connects to it. Once you have a pipe, you can send messages on it, and that's the whole point.
  • tensorpipe::Message, which is the the language that pipes read and write in. Pipes are streams of structured messages (not just raw byte buffers), and a message is composed of a "core" payload (memory living on CPU) plus a list of tensors (memory living on any device, like GPUs).

Sending a message from one end of the pipe to the other can be achieved using the write method, which takes a message (with the data to send) and a callback which will be invoked once the sending has completed. This callback will be invoked with an error (if one happened) and with the message.

Receiving a message takes two steps: on an incoming message, first the pipe asks you to provide some memory to hold the message in, and then you ask the pipe to read the data into that memory. In order to do this, first you must register a callback that will be notified for incoming messages. This is performed by calling the readDescriptor method with said callback. The callback will be invoked with a so-called descriptor, which can be seen as a "message skeleton", i.e., a message with no buffers attached to it (they are set to null pointers). The job of this callback is filling in those buffers, either by allocating the required memory or by obtaining it from somewhere else (from a cache, as a slice of a batch that's being assembled, ...). This descriptor also contains some metadata, given by the sender, which can be used to provide allocation hints or any other information that can help the receiver determine where to store the data. Once the message's buffers are ready, you can tell the pipe to go ahead and fill them in with the incoming data by passing the message to the read method, together with a callback which will be called when all the data has been received and stored. As when writing, this callback will be given a (possibly empty) error and the original message. The readDescriptor callback is one-shot, which means that after it fires it "expires" and will not be called again. It must be re-armed for a new event to be received.

When you pass a message to the pipe, to send it or to receive into it, you must not tamper with the underlying memory until the callback has completed, even if the write or read call already returned. (The write and read calls, and all other calls, are non-blocking so that it's easier to schedule asynchronous parallel trasfers without having to use threads). This means you can not deallocate the memory or alter it in any way, as the pipe may still be reading or modifying it. In other terms, you relinquish control over the memory when you pass a message to the pipe, only to reacquire it once the message is given back to you in the callback. This contract is encoded by the requirement to move the messages into and out of the pipe (using rvalue references). Also, because of this agreement, all callbacks will always be called, even if the pipe is closed or if it errors, in order to give back the memory.

The order in which messages are written to a pipe is preserved when these messages are read on the other side. Moreover, for a given pipe endpoint, the callbacks of the performed operations are executed in the same order that these operations were scheduled, even if the operations are performed asynchronously or out-of-band and thus may overlap or occur out of order. What this means is that if two write operations are scheduled one after the other back-to-back, even if the second one completes before the first one, its callback is delayed until the first one also completes and its callback is invoked. The same applies for reads. All the callbacks of all the pipes in a given context are called from the same per-context thread and thus no two callbacks will occur at the same time. However, different contexts will use different threads and their callbacks may thus overlap.

All the callbacks are invoked with an error reference. This may be "empty", i.e., indicate that no error has in fact occurred. In this case, the error object evaluates to false. In case of an actual error it will instead evaluate to true. When invoked with an error, the remaining arguments of the callback may be meaningless. For the read and write callbacks they will still contain the message that these methods will be invoked with, but the readDescriptor one will be an empty or invalid message. It should not be used.

There is no expectation for the readDescriptor callback to be armed at all times. Similarly, it is not necessary to call the read method immediately after a descriptor has been read. Both these possibilities are by design, in order to allow the user of the pipe to apply some backpressure in case it's receiving messages at a faster rate than it can handle, or for any other reason. This backpressure will be propagated to the lower-level components as as far down as possible (e.g., by stopping listening for readability events on the socket file descriptor).

Transports and channels

TensorPipe aims to be "backend-agnostic": it doesn't want to be restricted to a single way of copying data around but wants to be able to choose the fastest medium from a library of backends, based on the circumstances (e.g., are the two processes on the same machine?) and on the available hardware (e.g., are the GPUs connected with NVLink?). TensorPipe strives to have the largest selection of backends, enabling users to implement specific backends for their systems (should the default ones prove limited) and encouraging contributions.

The two processes that are establishing a pipe will automatically negotiate during setup to determine which of the backends they have at their disposal can be used and how well they would perform, in order to choose the best one in a way that is completely transparent to the user.

Backends come in two flavors:

  • Transports are the connections used by the pipes to transfer control messages, and the (smallish) core payloads. They are meant to be lightweight and low-latency. The most basic transport is a simple TCP one, which should work in all scenarios. A more optimized one, for example, is based on a ring buffer allocated in shared memory, which two processes on the same machine can use to communicate by performing just a memory copy, without passing through the kernel.

  • Channels are where the heavy lifting takes place, as they take care of copying the (larger) tensor data. High bandwidths are a requirement. Examples include multiplexing chunks of data across multiple TCP sockets and processes, so to saturate the NIC's bandwidth. Or using a CUDA memcpy call to transfer memory from one GPU to another using NVLink.

These different usage patterns promote different design choices when implementing transports and channels, which means the two are not perfectly interchangeable. For example, a TCP-based transport is best implemented using a single connection, whereas a TCP-based channel will benefit from using multiple connection and chunk and multiplex the payload over them in order to saturate the bandwidth even on the most powerful NICs.

Moreover, the APIs of transports and channels put different constraints on them, which demand and permit different approaches. As a rule of thumb, we require more from the transports: the only out-of-band information they can use is a simple address, which is all they can use to bootstrap the connection, and they need to include some "signaling" capabilities (a write on one side "wakes up" the other side by causing a read). Channels, on the other hand, have much looser requirements: they basically just need to implement a memcpy and, for anything beyond that, they can leverage a transport that the pipe gives to them for support.

License

TensorPipe is BSD licensed, as found in the LICENSE.txt file.

More Repositories

1

pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
Python
83,553
star
2

examples

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.
Python
22,311
star
3

vision

Datasets, Transforms and Models specific to Computer Vision
Python
15,925
star
4

tutorials

PyTorch tutorials.
Jupyter Notebook
8,075
star
5

captum

Model interpretability and understanding for PyTorch
Python
4,781
star
6

ignite

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
Python
4,507
star
7

serve

Serve, optimize and scale PyTorch models in production
Java
4,190
star
8

torchtune

PyTorch native finetuning library
Python
4,163
star
9

text

Models, data loaders and abstractions for language processing, powered by PyTorch
Python
3,490
star
10

ELF

ELF: a platform for game research with AlphaGoZero/AlphaZero reimplementation
C++
3,364
star
11

glow

Compiler for Neural Network hardware accelerators
C++
3,197
star
12

botorch

Bayesian optimization in PyTorch
Jupyter Notebook
3,043
star
13

torchchat

Run PyTorch LLMs locally on servers, desktop and mobile
Python
3,040
star
14

TensorRT

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
Python
2,565
star
15

audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
Python
2,471
star
16

xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
C++
2,469
star
17

rl

A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.
Python
2,241
star
18

torchtitan

A native PyTorch Library for large model training
Python
2,130
star
19

executorch

On-device AI across mobile, embedded and edge for PyTorch
C++
1,954
star
20

torchrec

Pytorch domain library for recommendation systems
Python
1,852
star
21

opacus

Training PyTorch models with differential privacy
Jupyter Notebook
1,666
star
22

tnt

A lightweight library for PyTorch training tools and utilities
Python
1,650
star
23

QNNPACK

Quantized Neural Network PACKage - mobile-optimized implementation of quantized neural network operators
C
1,519
star
24

android-demo-app

PyTorch android examples of usage in applications
Java
1,460
star
25

functorch

functorch is JAX-like composable function transforms for PyTorch.
Jupyter Notebook
1,388
star
26

hub

Submission to https://pytorch.org/hub/
Python
1,384
star
27

FBGEMM

FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
C++
1,156
star
28

data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
Python
1,112
star
29

cpuinfo

CPU INFOrmation library (x86/x86-64/ARM/ARM64, Linux/Windows/Android/macOS/iOS)
C
989
star
30

torchdynamo

A Python-level JIT compiler designed to make unmodified PyTorch programs faster.
Python
989
star
31

extension-cpp

C++ extensions in PyTorch
Python
980
star
32

benchmark

TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.
Python
841
star
33

translate

Translate - a PyTorch Language Library
Python
820
star
34

tensordict

TensorDict is a pytorch dedicated tensor container.
Python
816
star
35

elastic

PyTorch elastic training
Python
728
star
36

PiPPy

Pipeline Parallelism for PyTorch
Python
698
star
37

kineto

A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.
HTML
682
star
38

torcharrow

High performance model preprocessing library on PyTorch
Python
641
star
39

ao

PyTorch native quantization and sparsity for training and inference
Python
630
star
40

ios-demo-app

PyTorch iOS examples
Swift
595
star
41

tvm

TVM integration into PyTorch
C++
451
star
42

contrib

Implementations of ideas from recent papers
Python
390
star
43

ort

Accelerate PyTorch models with ONNX Runtime
Python
353
star
44

builder

Continuous builder and binary build scripts for pytorch
Shell
325
star
45

torchx

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
Python
319
star
46

accimage

high performance image loading and augmenting routines mimicking PIL.Image interface
C
317
star
47

extension-ffi

Examples of C extensions for PyTorch
Python
257
star
48

nestedtensor

[Prototype] Tools for the concurrent manipulation of variably sized Tensors.
Jupyter Notebook
252
star
49

pytorch.github.io

The website for PyTorch
HTML
222
star
50

torcheval

A library that contains a rich collection of performant PyTorch model metrics, a simple interface to create new metrics, a toolkit to facilitate metric computation in distributed training and tools for PyTorch model evaluations.
Python
210
star
51

cppdocs

PyTorch C++ API Documentation
HTML
206
star
52

workshops

This is a repository for all workshop related materials.
Jupyter Notebook
204
star
53

hydra-torch

Configuration classes enabling type-safe PyTorch configuration for Hydra apps
Python
199
star
54

multipy

torch::deploy (multipy for non-torch uses) is a system that lets you get around the GIL problem by running multiple Python interpreters in a single C++ process.
C++
169
star
55

torchsnapshot

A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind.
Python
142
star
56

java-demo

Jupyter Notebook
126
star
57

rfcs

PyTorch RFCs (experimental)
120
star
58

torchdistx

Torch Distributed Experimental
Python
115
star
59

extension-script

Example repository for custom C++/CUDA operators for TorchScript
Python
112
star
60

csprng

Cryptographically secure pseudorandom number generators for PyTorch
Batchfile
105
star
61

pytorch_sphinx_theme

PyTorch Sphinx Theme
CSS
94
star
62

test-infra

This repository hosts code that supports the testing infrastructure for the main PyTorch repo. For example, this repo hosts the logic to track disabled tests and slow tests, as well as our continuation integration jobs HUD/dashboard.
TypeScript
78
star
63

expecttest

Python
71
star
64

torchcodec

PyTorch video decoding
Python
46
star
65

maskedtensor

MaskedTensors for PyTorch
Python
38
star
66

add-annotations-github-action

A GitHub action to run clang-tidy and annotate failures
JavaScript
13
star
67

ci-hud

HUD for CI activity on `pytorch/pytorch`, provides a top level view for jobs to easily discern regressions
JavaScript
11
star
68

probot

PyTorch GitHub bot written in probot
TypeScript
11
star
69

ossci-job-dsl

Jenkins job definitions for OSSCI
Groovy
10
star
70

pytorch-integration-testing

Testing downstream libraries using pytorch release candidates
Makefile
6
star
71

docs

This repository is automatically generated to contain the website source for the PyTorch documentation at https//pytorch.org/docs.
HTML
4
star
72

torchhub_testing

Repo to test torchhub. Nothing to see here.
4
star
73

dr-ci

Diagnose and remediate CI jobs
Haskell
2
star
74

pytorch-ci-dockerfiles

Scripts for generating docker images for PyTorch CI
2
star
75

labeler-github-action

GitHub action for labeling issues and pull requests based on conditions
TypeScript
1
star