• Stars
    star
    199
  • Rank 192,869 (Top 4 %)
  • Language
    Scala
  • License
    Apache License 2.0
  • Created almost 7 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scientific computing with N-dimensional arrays

Compute.scala

Build Status Latest version Scaladoc

Compute.scala is a Scala library for scientific computing with N-dimensional arrays in parallel on GPU, CPU and other devices. It will be the primary back-end of the incoming DeepLearning.scala 3.0, to address performance problems we encountered in DeepLearning.scala 2.0 with ND4J.

  • Compute.scala can dynamically merge multiple operators into one kernel program, which runs significantly faster when performing complex computation.
  • Compute.scala manages data buffers and other native resources in a determinate approach, consuming less memory and reducing the performance impact due to garbage collection.
  • All dimensional transformation operators (permute, broadcast, translate, etc) in Compute.scala are views, with no additional data buffer allocation.
  • N-dimensional arrays in Compute.scala can be split to JVM collections, which support higher-ordered functions like map / reduce, and still can run on GPU.

Getting started

System Requirements

Compute.scala is based on LWJGL 3's OpenCL binding, which supports AMD, NVIDIA and Intel's GPU and CPU on Linux, Windows and macOS.

Make sure you have met the following system requirements before using Compute.scala.

  • Linux, Windows or macOS
  • JDK 8
  • OpenCL runtime

The performance of Compute.scala varies with OpenCL runtimes. For best performance, install the OpenCL runtime according to the following table.

Linux Windows macOS
NVIDIA GPU NVIDIA GPU Driver NVIDIA GPU Driver macOS's built-in OpenCL SDK
AMD GPU AMDGPU-PRO Driver AMD OpenCLโ„ข 2.0 Driver macOS's built-in OpenCL SDK
Intel or AMD CPU POCL POCL POCL

Especially, Compute.scala produces non-vectorized code, which needs POCL's auto-vectorization feature for best performance when running on CPU.

Project setup

The artifacts of Compute.scala is published on Maven central repository for Scala 2.11 and 2.12. Add the following settings to your build.sbt if you are using sbt.

libraryDependencies += "com.thoughtworks.compute" %% "cpu" % "latest.release"

libraryDependencies += "com.thoughtworks.compute" %% "gpu" % "latest.release"

// LWJGL OpenCL library
libraryDependencies += "org.lwjgl" % "lwjgl-opencl" % "latest.release"

// Platform dependent runtime of LWJGL core library
libraryDependencies += ("org.lwjgl" % "lwjgl" % "latest.release").jar().classifier {
  import scala.util.Properties._
  if (isMac) {
    "natives-macos"
  } else if (isLinux) {
    "natives-linux"
  } else if (isWin) {
    "natives-windows"
  } else {
    throw new MessageOnlyException(s"lwjgl does not support $osName")
  }
}

Check Compute.scala on Scaladex and LWJGL customize tool for settings for Maven, Gradle and other build tools.

Creating an N-dimensional array

Import types in gpu or cpu object according to the OpenCL runtime you want to use.

// For N-dimensional array on GPU
import com.thoughtworks.compute.gpu._
// For N-dimensional array on CPU
import com.thoughtworks.compute.cpu._

In Compute.scala, an N-dimensional array is typed as Tensor, which can be created from Seq or Array.

val my2DArray: Tensor = Tensor(Array(Seq(1.0f, 2.0f, 3.0f), Seq(4.0f, 5.0f, 6.0f)))

If you print out my2DArray,

println(my2DArray)

then the output should be

[[1.0,2.0,3.0],[4.0,5.0,6.0]]

You can also print the sizes of each dimension using the shape method.

// Output 2 because my2DArray is a 2D array.
println(my2DArray.shape.length)

// Output 2 because the size of first dimension of my2DArray is 2.
println(my2DArray.shape(0)) // 2

// Output 3 because the size of second dimension of my2DArray is 3.
println(my2DArray.shape(1)) // 3

So my2DArray is a 2D array of 2x3 size.

Scalar value

Note that a Tensor can be a zero dimensional array, which is simply a scalar value.

val scalar = Tensor(42.0f)
println(scalar.shape.length) // 0

Element-wise operators

Element-wise operators are performed for each element of in Tensor operands.

val plus100 = my2DArray + Tensor.fill(100.0f, Array(2, 3))

println(plus100) // Output [[101.0,102.0,103.0],[104.0,105.0,106.0]]

Design

Lazy-evaluation

Tensors in Compute.scala are immutable and lazy-evaluated. All operators that create Tensors are pure, which allocate zero data buffer and not execute any time-consuming tasks. The actual computation is only performed when the final result is requested.

For example:

val a = Tensor(Seq(Seq(1.0f, 2.0f, 3.0f), Seq(4.0f, 5.0f, 6.0f)))
val b = Tensor(Seq(Seq(7.0f, 8.0f, 9.0f), Seq(10.0f, 11.0f, 12.0f)))
val c = Tensor(Seq(Seq(13.0f, 14.0f, 15.0f), Seq(16.0f, 17.0f, 18.0f)))

val result: InlineTensor = a * b + c

All the Tensors, including a, b, c and result are small JVM objects and no computation is performed up to now.

println(result.toString)

When result.toString is called, the Compute.scala compiles the expression a * b + c into one kernel program and execute it.

Both result and the temporary variable a * b are InlineTensors, indicating their computation can be inlined into a more complex kernel program. You can think of an InlineTensor as an @inline def method on device side.

This approach is faster than other libraries because we don't have to execute two kernels for multiplication and addition respectively.

Check the Scaladoc seeing which operators return InlineTensor or its subtype TransformedTensor, which can be inlined into a more complex kernel program as well.

Caching

By default, when result.toString is called more than once, the expression a * b + c is executed more than once.

println(result.toString)

// The computation is performed, again
println(result.toString)

Fortunately, we provides a doCache method to eagerly allocate data buffer for a CachedTensor.

import com.thoughtworks.future._
import com.thoughtworks.raii.asynchronous._

val Resource(cachedTensor, releaseCache) = result.doCache.acquire.blockingAwait

try {
  // The cache is reused. No device-side computation is performed.
  println(cachedTensor.toString)

  // The cache is reused. No device-side computation is performed.
  println(cachedTensor.toString)

  val tmp: InlineTensor = exp(cachedTensor)
  
  // The cache for cachedTensor is reused, but the exponential function is performed.
  println(tmp.toString)

  // The cache for cachedTensor is reused, but the exponential function is performed, again.
  println(tmp.toString)
} finally {
  releaseCache.blockingAwait
}

// Crash because the data buffer has been released
println(releaseCache.toString)

The data buffer allocated for cachedTensor is kept until releaseCache is performed.

You can think of a CachedTensor as a lazy val on device side.

By combining pure Tensors along with the impure doCache mechanism, we achieved the following goals:

  • All Tensors are pure. No data buffer is allocated when creating them.
  • The computation of Tensors can be merged together, to minimize the number of intermediate data buffers and kernel programs.
  • The developers can create caches for Tensors, as a determinate way to manage the life-cycle of resources.

Mutable variables

Tensors are immutable, but you can create mutable variables of cached tensor to workaround the limitation.

var Resource(weight, releaseWeight) = Tensor.random(Array(32, 32)).doCache.acquire.blockingAwait

while (true) {
  val Resource(newWeight, releaseNewWeight) = (weight * Tensor.random(Array(32, 32))).doCache.acquire.blockingAwait
  
  releaseWeight.blockingAwait
  
  weight = newWeight
  releaseWeight = releaseNewWeight
}

Use this approach with caution. doCache should be only used for permanent data (e.g. the weights of a neural network). doCache is not designed for intermediate variables in a complex expression. A sophisticated Scala developer should be able to entirely avoid var and while in favor of recurisive functions.

Scala collection interoperability

split

A Tensor can be split into small Tensors on the direction of a specific dimension.

For example, given a 3D tensor whose shape is 2ร—3ร—4,

val my3DTensor = Tensor((0.0f until 24.0f by 1.0f).grouped(4).toSeq.grouped(3).toSeq)

val Array(2, 3, 4) = my3DTensor.shape

when split it at the dimension #0,

val subtensors0: Seq[Tensor] = my3DTensor.split(dimension = 0)

then the result should be a Seq of two 3ร—4 tensors.

// Output: TensorSeq([[0.0,1.0,2.0,3.0],[4.0,5.0,6.0,7.0],[8.0,9.0,10.0,11.0]], [[12.0,13.0,14.0,15.0],[16.0,17.0,18.0,19.0],[20.0,21.0,22.0,23.0]])
println(subtensors0)

When split it at the dimension #1,

val subtensors1: Seq[Tensor] = my3DTensor.split(dimension = 1)

then the result should be a Seq of three 2ร—4 tensors.

// Output: TensorSeq([[0.0,1.0,2.0,3.0],[12.0,13.0,14.0,15.0]], [[4.0,5.0,6.0,7.0],[16.0,17.0,18.0,19.0]], [[8.0,9.0,10.0,11.0],[20.0,21.0,22.0,23.0]])
println(subtensors1)

Then you can use arbitrary Scala collection functions on the Seq of subtensors.

join

Multiple Tensors of the same shape can be merged into a larger Tensor via the Tensor.join function.

Given a Seq of three 2ร—2 Tensors,

val mySubtensors: Seq[Tensor] = Seq(
  Tensor(Seq(Seq(1.0f, 2.0f), Seq(3.0f, 4.0f))),
  Tensor(Seq(Seq(5.0f, 6.0f), Seq(7.0f, 8.0f))),
  Tensor(Seq(Seq(9.0f, 10.0f), Seq(11.0f, 12.0f))),
)

when joining them,

val merged: Tensor = Tensor.join(mySubtensors)

then the result should be a 2x2x3 Tensor.

// Output: [[[1.0,5.0,9.0],[2.0,6.0,10.0]],[[3.0,7.0,11.0],[4.0,8.0,12.0]]]
println(merged.toString)

Generally, when joining n Tensors of shape a0โ€…ร—โ€…a1โ€…ร—โ€…a2โ€…ร—โ€… โ‹ฏโ€…ร—โ€…ai , the shape of the result Tensor is a0โ€…ร—โ€…a1โ€…ร—โ€…a2โ€…ร—โ€… โ‹ฏโ€…ร—โ€…aiโ€…ร—โ€…n

Case study: fast matrix multiplication via split and join

By combining split and join, you can create complex computation in the following steps:

  1. Using split to create Seqs from some of dimensions of Tensors.
  2. Using Scala collection functions to manipulate Seqs.
  3. Using join to merge transformed Seq back to Tensor.

For example, you can implement matrix multiplication in this style.

def matrixMultiply1(matrix1: Tensor, matrix2: Tensor): Tensor = {
  val columns1 = matrix1.split(1)
  val columns2 = matrix2.split(1)
  val resultColumns = columns2.map { column2: Tensor =>
    (columns1 zip column2.split(0))
      .map {
        case (l: Tensor, r: Tensor) =>
          l * r.broadcast(l.shape)
      }
      .reduce[Tensor](_ + _)
  }
  Tensor.join(resultColumns)
}

You can imagine the Scala collection function calls as the code generator of the kernel program, thus the loop running in Scala collection will finally become an unrolled loop in the kernel program.

The above matrixMultiply1 will create a kernel program that contains an unrolled loop of each row and column of matrix2. Thus it runs very fast when matrix1 is big and matrix2 is small. Our benchmark shows that the above matrixMultiply1 runs even faster than ND4J's cuBLAS back-end, on a Titan X GPU, when matrix1 is 65536ร—8 and matrix2 is 8ร—8.


You can also create another version of matrix multiplication, which only unrolls the loop of each row of matrix2.

def matrixMultiply2(matrix1: Tensor, matrix2: Tensor): Tensor = {
  val Array(i, j) = matrix1.shape
  val Array(`j`, k) = matrix2.shape
  val broadcastMatrix1 = matrix1.broadcast(Array(i, j, k))
  val broadcastMatrix2 = matrix2.reshape(Array(1, j, k)).broadcast(Array(i, j, k))
  val product = broadcastMatrix1 * broadcastMatrix2
  product.split(1).reduce[Tensor](_ + _)
}

matrixMultiply2 will run faster than matrixMultiply1 when matrix1 is small.

A sophisticated matrix multiplication should dynamically switch the two implementations according to matrix size.

val UnrollThreshold = 4000

def matrixMultiply(matrix1: Tensor, matrix2: Tensor): Tensor = {
  if (matrix1.shape.head >= UnrollThreshold) {
    matrixMultiply1(matrix1, matrix2)
  } else {
    matrixMultiply2(matrix1, matrix2)
  }
}

The final version of matrixMultiply will have good performance for both small and big matrixes.

Benchmark

We created some benchmarks for Compute.scala and ND4J on NVIDIA and AMD GPU in an immutable style.

Some information can be found in the benchmark result:

  • Apparently, Compute.scala supports both NVIDIA GPU and AMD GPU, while ND4J does not support AMD GPU.
  • Compute.scala is faster than ND4J when performing complex expressions.
  • Compute.scala is faster than ND4J on large arrays.
  • ND4J is faster than Compute.scala when performing one simple primary operation on small arrays.
  • ND4J's permute and broadcast are extremely slow, causing very low score in the convolution benchmark.

Note that the above result of ND4J is not the same as the performance in Deeplearning4j, because Deeplearning4j uses ND4J in a mutable style (i.e. a *= b; a += c instead of a * b + c) and ND4J has some undocumented optimizions for permute and broadcast when they are invoked with some special parameters from Deeplearning4j.

Future work

Now this project is only a minimum viable product. Many important features are still under development:

  • Support tensors of elements other than single-precision floating-point (#104).
  • Add more OpenCL math functions (#101).
  • Further optimization of performance (#62, #103).
  • Other back-ends (CUDA, Vulkan Compute).

Contribution is welcome. Check good first issues to start hacking.

More Repositories

1

Binding.scala

Reactive data-binding for Scala
Scala
1,579
star
2

DeepLearning.scala

A simple library for creating complex neural networks
Scala
763
star
3

DeepDarkFantasy

A Programming Language for Deep Learning
Haskell
465
star
4

cd4ml-workshop

Repository with sample code and instructions for "Continuous Intelligence" and "Continuous Delivery for Machine Learning: CD4ML" workshops
Jupyter Notebook
316
star
5

Dsl.scala

A framework to create embedded Domain-Specific Languages in Scala
Scala
255
star
6

each

A macro library that converts native imperative syntax to scalaz's monadic expressions
Scala
253
star
7

guia-de-desenvolvimento-tecnico

JavaScript
206
star
8

CD4ML-Scenarios

Repository with sample code and instructions for "Continuous Intelligence" and "Continuous Delivery for Machine Learning: CD4ML" workshops
Python
138
star
9

microbuilder

A toolset that helps you build system across multiple micro-services and multiple languages.
HTML
93
star
10

sbt-api-mappings

An Sbt plugin that fills apiMappings for common Scala libraries.
Scala
88
star
11

enableIf.scala

A library that toggles Scala code at compile-time, like #if in C/C++
Scala
65
star
12

todo

Binding.scala โ€ข TodoMVC
Scala
60
star
13

sinais

๐Ÿ”ฃ Desenvolvimento passo a passo do exemplo `sinais` em Go.
Go
60
star
14

sbt-best-practice

Configure common build settings for a Scala project
Scala
56
star
15

TWU101-TDDIntro

Java
46
star
16

template.scala

C++ Flavored Template Metaprogramming in Scala
Scala
40
star
17

future.scala

Stack-safe asynchronous programming
Scala
39
star
18

ml-app-template

An ML project template with sensible defaults
Python
37
star
19

sbt-scala-js-map

A Sbt plugin that configures source mapping for Scala.js projects hosted on Github
Scala
36
star
20

aws_role_credentials

Generates AWS credentials for roles using STS
Python
34
star
21

transervicos

Ruby
33
star
22

RAII.scala

Resource Acquisition Is Initialization
Scala
32
star
23

sbt-example

Run Scaladoc as unit tests
Scala
31
star
24

feature.scala

Access Scala language features on the type-level
Scala
31
star
25

sbt-ammonite-classpath

Export the classpath for Ammonite and Almond
Scala
27
star
26

Import.scala

A Scala compiler plugin for magic imports
Scala
26
star
27

JS-Monthly-Chengdu

CSS
23
star
28

infra-problem

resources for the infrastructure as code practical assessment
Clojure
23
star
29

bindable.scala

User-friendly Binding.scala components
Scala
23
star
30

ml-cd-starter-kit

Set up cross-cutting services (e.g. CI server, monitoring) for ML projects using kubernetes and helm
Smarty
23
star
31

implicit-dependent-type

Scala
22
star
32

Extractor.scala

Make PartialFunction and extractors composable
Scala
22
star
33

objective8

For the most up to date version of this project, see https://github.com/d-cent/objective8
Clojure
21
star
34

oktaauth

Module and CLI client to handle Okta authentication
Python
20
star
35

js-test-project

JavaScript
18
star
36

tryt.scala

Monad transformers for exception handling
Scala
18
star
37

lein-s3-static-deploy

Lein task to deploy static website to s3 bucket.
Clojure
17
star
38

DesignPattern.scala

Functional Programming Design Patterns
Scala
17
star
39

TWTraining

Open source ThoughtWorks training materials
HTML
15
star
40

dsl-domains-cats

Scala
12
star
41

Q.scala

Convert any value to code
Scala
12
star
42

ZeroCost.scala

Zero-cost Abstractions in Scala
Scala
10
star
43

Constructor.scala

Mixin classes and traits dynamically
Scala
10
star
44

dataclouds

Blog for dataclouds@thoughtworks.
CSS
10
star
45

tf-image-interpreter

Object detection and text spotting from images of any size. Based on TensorFlow.
Python
10
star
46

WorkingEffectivelyWithLegacyCode

Java
10
star
47

Binding.scala-website

Scala
9
star
48

voter-service

The Voter Spring Boot RESTful Web Service, backed by MongoDB, is used for DevOps-related training and testing.
Java
9
star
49

OpenStack-EC2-Driver

OpenStack-EC2-Driver
9
star
50

stonecutter

[Main repo found at https://github.com/d-cent/stonecutter] A D-CENT project: an easily deployable oauth server for small organisations.
Clojure
9
star
51

streaming-data-pipeline

Streaming pipeline repo for data engineering training program
Scala
9
star
52

JavaBootcamp

Java
8
star
53

java-test-project

Java
7
star
54

infra-code-workshop

TechRadar Academy em PoA - Cloud
7
star
55

twseleniumworkshop

Workshop Selenium Belo Horizonte - Setembro 2014
Java
7
star
56

DeepLearning.scala-website

The website of DeepLearning.scala
CSS
7
star
57

AS101-4-workshop

Python
7
star
58

skadoosh

Here we have the building blocks of a virtual entity in the making (in crude words, a chat bot - but don't call it that. It gets offended).
Python
7
star
59

json-stream-core

Universal Serialization Framework for JSON
Haxe
6
star
60

ScaleWorks_YUMChina

Ruby
6
star
61

lein-filegen

A leiningen plugin to generate files
Clojure
6
star
62

microbuilder-core

Haxe
6
star
63

clj-http-s3

Middleware to allow cli-http to authenticate with s3
Clojure
6
star
64

expend-rs

Internal application to submit certain expenses to ThoughtWorks' system
Rust
6
star
65

LatestEvent.scala

bidirectional data-binding and routing for Scala.js
Scala
6
star
66

sbt-jdeps

an sbt plugin to run JDeps
Scala
5
star
67

modularizer

Scala
5
star
68

twu-toolkit

Calendar generator for TWU
Ruby
5
star
69

twcss

CSS Coding Guidelines
5
star
70

loans-lah-tdd-workshop

JavaScript
5
star
71

Binding.scala-play-template

Scala
5
star
72

TypeOf.scala

Create types from expressions
Scala
4
star
73

SG-ObjectBootcamp

Java
4
star
74

sbt-delombok

an sbt plug-in to delombok Java sources files that contain Lombok annotations
Scala
4
star
75

zeratul

a wrapper for JPA
Java
4
star
76

CSharpTestProject

C#
4
star
77

HashRoute.scala

Scala
4
star
78

Tensor.scala

A totally functional DSL for general purpose GPU programming
4
star
79

akka-http-rpc

Turn akka-http to a RPC server
Scala
4
star
80

monadic-deep-learning

TeX
4
star
81

wxapp-workshop

3
star
82

sde

Scala
3
star
83

clojuregoat

A goat, in Clojure
Clojure
3
star
84

sonic

React UI Components
JavaScript
3
star
85

go-maven-poller

Go plugin that polls Maven (Nexus) repositories
3
star
86

mooncake

A D-CENT project: Secure notifications combined with w3 activity streams
Clojure
3
star
87

Binding.scala-activator-template

Scala
3
star
88

akka-http-webjars

Serve static assets from WebJars
Scala
3
star
89

scala-project-template

3
star
90

offnet

The Unified Neural Network
Jupyter Notebook
3
star
91

aem-training-2016

Code repo for AEM training in August 2016.
Java
3
star
92

hackerbrasileiro

Java
3
star
93

helsinki

[Main repo found at https://github.com/d-cent/decisionsproto] Spike for indexing data from the Open Ahjo API in elasticsearch
Python
3
star
94

FallbackLookupStrategy.java

Java
2
star
95

cep-conference

Core Engineering Practices "Conference" exercise
Java
2
star
96

Cifar10.scala

Scala
2
star
97

Kaleidoscopez

JavaScript
2
star
98

infra-code-devopslabs01

TechRadar Academy - Cloud Workshop - IaC
Python
2
star
99

android-test-project

Kotlin
2
star
100

FunctionalPattern

Scala
2
star