• Stars
    star
    104
  • Rank 320,297 (Top 7 %)
  • Language
    Scala
  • Created over 9 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Distributed Matrix Operations Library Built on Top of Spark

Marlin

A distributed matrix operations library build on top of Spark. Now, the master branch is in version 0.4-SNAPSHOT.

##Branches Notice This branch(spark-marlin) built on a custom version Spark to get better performance for matrix operations, however this branch has not been published out. If you use the official version Spark, please refer to master branch or spark-1.0.x branch

##Prerequisites As Marlin is built on top of Spark, you need to get the Spark installed first. If you are not clear how to setup Spark, please refer to the guidelines here. Currently, Marlin is developed on the APIs of Spark 1.4.0 version.

##Compile Marlin We use Maven to build our project currently, you can just type mvn package -DskipTests to get the jar package. Moreover, you can assign profile e.g. spark-1.3, spark-1.2, hadoop-2.4, to build Marlin according to your environment.

As the API changes in Breeze, we have specially created a new branch named spark-1.0.x which means it is compatible with Spark version 1.0.x, while the master branch mainly focus on the later newest versions of Spark

##Run Marlin We have already offered some examples in edu.nju.pasalab.marlin.examples to show how to use the APIs in the project. For example, if you want to run two large matrices multiplication, use spark-submit method, and type in command

$./bin/spark-submit \
 --class edu.nju.pasalab.marlin.examples.MatrixMultiply
 --master <master-url> \
 --executor-memory <memory> \
 marlin_2.10-0.2-SNAPSHOT.jar \
 <matrix A rows> <martrix A columns> \
 <martrix B columns> <cores cross the cluster>

Note: Because the pre-built Spark-assembly jar doesn't have any files about netlib-java native compontent, which means you cannot use the native linear algebra library e.g BLAS to accelerate the computing, but have to use pure java to perform the small block matrix multiply in every worker. We have done some experiments and find it has a significant performance difference between the native BLAS computing and the pure java one, here you can find more info about the performance comparison and how to load native library.

Note: this example use MTUtils.randomDenVecMatrix to generate distributed random matrix in-memory without reading data from files.

Note: <cores cross the cluster> is the num of cores across the cluster you want to use.

##Martix Operations API in Marlin Currently, we have finished some APIs, you can find documentation in this page.

##Algorithms and Performance Evaluation The details of the matrix multiplication algorithm is here.

###Performance Evaluation We have done some performance evaluation of Marlin. It can be seen here.

##Contact gurongwalker at gmail dot com

myasuka at live dot com

More Repositories

1

MR-Course-Assignments

Assignments for courses of MapReduce
Shell
31
star
2

dolphin

Dolphin - a Deep Learning on MIC architecture Project.
C++
24
star
3

SmartFD

SmartFD: Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms
Scala
17
star
4

tachyon-perf

A General Performance Test Framework for Tachyon
Java
16
star
5

dfs-perf

A general performance test framework for Distributed File System
Java
14
star
6

NAS-CTR

Python
12
star
7

DGST

DGST: Efficient and Scalable Generalized Suffix Tree Construction on Apache Spark
Scala
11
star
8

Liquid

Intelligent Resource Requirement Estimation and Scheduling for Deep Learning Jobs on Distributed GPU Clusters
Python
10
star
9

DIFER

Python
10
star
10

forestlayer

ForestLayer: Efficient and scalable deep forest learning library based on Ray
Python
10
star
11

AdaMCL

PyTorch implementation of AdaCML
Python
10
star
12

BigSpa

A framework for large-sacle static program analysis.
Java
8
star
13

seal

Training Large Scale Statistical Machine Translation Models on Spark
Scala
4
star
14

Octopus-DF

A cross-platfrom pandas-like Dataframe based on Pandas, Spark and Dask.
Python
4
star
15

HAGNN

Hybrid Aggregation for Heterogeneous Graph Neural Networks
Python
4
star
16

PSP

Python
4
star
17

cichlid

Cichlid is a distributed RDFS & OWL reasoning system based on Spark.
Scala
4
star
18

EAAFE

The codes for paper "Evolutionary Automated Feature Engineering"
Python
3
star
19

Coral

Coral: Federated Query Join Order Optimization Based on Deep Reinforcement Learning
Java
2
star
20

Magpie

Efficient Big Data Query System Parameter Optimization based on Pre-selection and Search Pruning Approach
Java
2
star
21

trasa

Implementation for Transition Relation Aware Self-Attention for Session-based Recommendation
Python
2
star
22

AutoAC

Python
2
star
23

SparkDQ

The code repository for SparkDQ, a big data quality management system.
Python
2
star
24

PMPAS

Python
1
star
25

TSSE

Topic model for short text
Python
1
star
26

PGA

partial attack for graph global attack
Jupyter Notebook
1
star
27

FSClientCache

code repo for file system client-side cache paper
Java
1
star
28

Raven

Benchmarking query engines with pre-computation on cloud
Python
1
star
29

UniGPS

Unified Graph Programming Framework
1
star
30

GADAM

1
star