oneAPI Collective Communications Library (oneCCL)
Installation | Usage | Release Notes | Documentation | How to Contribute | License
oneAPI Collective Communications Library (oneCCL) provides an efficient implementation of communication patterns used in deep learning.
oneCCL is integrated into:
- Horovod* (distributed training framework). Refer to Horovod with oneCCL for details.
- PyTorch* (machine learning framework). Refer to PyTorch bindings for oneCCL for details.
oneCCL is part of oneAPI.
Table of Contents
Prerequisites
- Ubuntu* 18
- GNU*: C, C++ 4.8.5 or higher.
Refer to System Requirements for more details.
SYCL support
Intel(R) oneAPI DPC++/C++ Compiler with Level Zero v1.0 support.
To install Level Zero, refer to the instructions in Intel(R) Graphics Compute Runtime repository or to the installation guide for oneAPI users.
BF16 support
- AVX512F-based implementation requires GCC 4.9 or higher.
- AVX512_BF16-based implementation requires GCC 10.0 or higher and GNU binutils 2.33 or higher.
Installation
General installation scenario:
cd oneccl
mkdir build
cd build
cmake ..
make -j install
If you need a clean build, create a new build directory and invoke cmake
within it.
You can also do the following during installation:
- Specify installation directory
- Specify the compiler
- Specify
SYCL
cross-platform abstraction level - Specify the build type
- Enable
make
verbose output
Usage
Launching Example Application
Use the command:
$ source <install_dir>/env/setvars.sh
$ mpirun -n 2 <install_dir>/examples/benchmark/benchmark
Setting workers affinity
There are two ways to set worker threads (workers) affinity: automatically and explicitly.
Automatic setup
- Set the
CCL_WORKER_COUNT
environment variable with the desired number of workers per process. - Set the
CCL_WORKER_AFFINITY
environment variable with the valueauto
.
Example:
export CCL_WORKER_COUNT=4
export CCL_WORKER_AFFINITY=auto
With the variables above, oneCCL will create four workers per process and the pinning will depend from process launcher.
If an application has been launched using mpirun
that is provided by oneCCL distribution package then workers will be automatically pinned to the last four cores available for the launched process. The exact IDs of CPU cores can be controlled by mpirun
parameters.
Otherwise, workers will be automatically pinned to the last four cores available on the node.
Explicit setup
- Set the
CCL_WORKER_COUNT
environment variable with the desired number of workers per process. - Set the
CCL_WORKER_AFFINITY
environment variable with the IDs of cores to pin local workers.
Example:
export CCL_WORKER_COUNT=4
export CCL_WORKER_AFFINITY=3,4,5,6
With the variables above, oneCCL will create four workers per process and pin them to the cores with the IDs of 3, 4, 5, and 6 respectively.
Using oneCCL package from CMake
oneCCLConfig.cmake
and oneCCLConfigVersion.cmake
are included into oneCCL distribution.
With these files, you can integrate oneCCL into a user project with the find_package command. Successful invocation of find_package(oneCCL <options>)
creates imported target oneCCL
that can be passed to the target_link_libraries command.
For example:
project(Foo)
add_executable(foo foo.cpp)
# Search for oneCCL
find_package(oneCCL REQUIRED)
# Connect oneCCL to foo
target_link_libraries(foo oneCCL)
oneCCLConfig files generation
To generate oneCCLConfig files for oneCCL package, use the provided cmake/scripts/config_generation.cmake
file:
cmake [-DOUTPUT_DIR=<output_dir>] -P cmake/script/config_generation.cmake
Additional Resources
Blog Posts
- Optimizing DLRM by using PyTorch with oneCCL Backend
- Intel MLSL Makes Distributed Training with MXNet Faster
Workshop Materials
- oneAPI, oneCCL and OFI: Path to Heterogeneous Architecure Programming with Scalable Collective Communications: recording and slides
Contribute
See CONTRIBUTING for more information.
License
Distributed under the Apache License 2.0 license. See LICENSE for more information.
Security Policy
See SECURITY for more information.