• Stars
    star
    131
  • Rank 275,867 (Top 6 %)
  • Language
    Scala
  • License
    GNU General Publi...
  • Created about 8 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scala for Statistical Computing and Data Science Short Course

Scala for Statistical Computing and Data Science Short Course

I occasionally run this course in-house for companies - email me if your company is interested in this. Also note that I run an advanced course on Category theory for pure FP in Scala

Registered course participants should bookmark the Start Here page. Please carefully follow the laptop setup instructions in advance of the start of the course.

Outline course description

This course is aimed at statisticians and data scientists already familiar with a dynamic programming language (such as R, Python or Octave) who would like to learn how to use Scala. Scala is a free modern, powerful, strongly-typed, functional programming language, well-suited to statistical computing and data science applications. In particular, it is fast and efficient, runs on the Java virtual machine (JVM), and is designed to easily exploit modern multi-core and distributed computing architectures.

The course will begin with an introduction to the Scala language and basic concepts of functional programming (FP), as well as essential Scala tools such as SBT for managing builds and library dependencies. The course will continue with an overview of the Scala collections library, including parallel collections, and we will see how parallel collections enable trivial parallelisation of many statistical computing algorithms on multi-core hardware. We will next survey the wider Scala library ecosystem, paying particular attention to Breeze, the Scala library for scientific computing and numerical linear algebra. We will see how to exploit non-uniform random number generation and matrix computations in Breeze for statistical applications. Both maximum-likelihood and simulation-based Bayesian statistical inference algorithms will be considered. Much of the final day will be dedicated to understanding Apache Spark, the distributed Big Data analytics platform for Scala. We will understand how Spark relates to the parallel collections we have already examined, and see how it can be used not only for the processing of very large data sets, but also for the parallel and distributed analysis of large or otherwise computationally-intensive models. As time permits, we will discuss more advanced FP concepts, such as typeclasses, higher-kinded types, monoids, functors, monads, applicatives, streams and streaming data, and see how these enable the development of flexible, scalable, generic code in strongly-typed functional languages.

Prerequisite

The course assumes a basic familiarity with essential concepts in statistical computing, as well as some basic programming experience. It is assumed that participants will be familiar with writing their own functions in a language such as R, including essential control structures such as "for-loops" and "if-statements". The course is not suitable for people completely new to programming. However, no prior knowledge of Scala or functional programming is assumed. All participants will be expected to bring their own (multi-core) laptop and to have a recent version of Java pre-installed. Other set-up instructions will be provided in advance to registered participants.

Course structure

The course will be delivered through a combination of lectures, live demos and hands-on practical sessions. For the practical sessions, participants will be expected to actively engage with the material, run demos, follow examples, and write code to solve simple problems.

Presenters

The course will be delivered by Prof Darren Wilkinson (Newcastle University, U.K.). Prof Wilkinson is co-Director of Newcastle's EPSRC Centre for Doctoral Training in Cloud Computing for Big Data, and a Turing Fellow. He is a well-known expert in computational Bayesian statistics and a leading proponent of the use of strongly-typed FP languages (such as Scala) for scalable statistical computing.

More Repositories

1

fp-ssc-course

An introduction to functional programming for scalable statistical computing
Scala
70
star
2

logreg

Bayesian inference for a logistic regression model in various languages
Python
43
star
3

blog

Code samples associated with my blog posts
Scala
35
star
4

smfsb

Documentation, models and code relating to the 3rd edition of the textbook Stochastic Modelling for Systems Biology
AMPL
32
star
5

scala-glm

Scala library for fitting linear and generalised linear statistical models
Scala
28
star
6

djwhacks

Test repo for sharing code snippets and learning about git
PostScript
13
star
7

scala-smfsb

Scala library for biochemical network simulation, associated with the 3rd edition of the textbook Stochastic Modelling for Systems Biology
Scala
12
star
8

statslang-scala

LaTeX slides and scala code for the 2014 RSS Statistical Computing Section Meeting on languages for statistical computing
TeX
9
star
9

fps-course

Category theory for pure functional programming in Scala - materials for course participants
Scala
8
star
10

scala-view

Small Scala library for viewing a Stream of ScalaFX Images (or Swing/AWT BufferedImages) in a window on-screen
Scala
7
star
11

isba2021

Material relating to my ISBA 2021 presentation
Makefile
6
star
12

metagenomics

Materials for my taxa abundance/metagenomics hands-on training session
R
5
star
13

unbiased-mcmc

Unbiased MCMC with couplings
TeX
5
star
14

stats-dce-workshop

Materials for a workshop on statistical methods using R, for geotechnical engineers
HTML
5
star
15

dsmts

Discrete stochastic models test suite
TeX
4
star
16

sbml-sh

SBML-shorthand - a concise notation for Systems Biology models targeting SBML
Python
3
star
17

code-examples

Some simple coding examples
C
3
star
18

BWK

Code from Boys, Wilkinson and Kirkwood (2008)
C
2
star
19

python-smfsb

Python library for the book, Stochastic modelling for systems biology, third edition
Python
2
star
20

FPNEM-2020-05

Materials for the (online) May 2020 FP North East Meetup
Scala
1
star
21

FPNEM-2016-04

Repo for FP North East meetup, April 2016
Scala
1
star
22

sv

C Library for Bayesian MCMC fitting of (factor) stochastic volatility models
C
1
star
23

FPNEM-2017-03

FP North East meetup: Hands-on introduction to Apache Spark
Scala
1
star
24

breeze.g8

SBT Template for Scala + Breeze
Scala
1
star
25

talks

Supporting information for talks
1
star
26

monte-scala

Tools for the development of Monte Carlo algorithms in Scala
Scala
1
star
27

scala-course-exsol

Sketch solutions for my Scala course end-of-chapter exercises
Scala
1
star