• Stars
    star
    105
  • Rank 328,196 (Top 7 %)
  • Language
    Scala
  • Created about 10 years ago
  • Updated almost 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Distributed Matrix Operations Library Built on Top of Spark

Marlin

A distributed matrix operations library build on top of Spark. Now, the master branch is in version 0.4-SNAPSHOT.

##Branches Notice This branch(spark-marlin) built on a custom version Spark to get better performance for matrix operations, however this branch has not been published out. If you use the official version Spark, please refer to master branch or spark-1.0.x branch

##Prerequisites As Marlin is built on top of Spark, you need to get the Spark installed first. If you are not clear how to setup Spark, please refer to the guidelines here. Currently, Marlin is developed on the APIs of Spark 1.4.0 version.

##Compile Marlin We use Maven to build our project currently, you can just type mvn package -DskipTests to get the jar package. Moreover, you can assign profile e.g. spark-1.3, spark-1.2, hadoop-2.4, to build Marlin according to your environment.

As the API changes in Breeze, we have specially created a new branch named spark-1.0.x which means it is compatible with Spark version 1.0.x, while the master branch mainly focus on the later newest versions of Spark

##Run Marlin We have already offered some examples in edu.nju.pasalab.marlin.examples to show how to use the APIs in the project. For example, if you want to run two large matrices multiplication, use spark-submit method, and type in command

$./bin/spark-submit \
 --class edu.nju.pasalab.marlin.examples.MatrixMultiply
 --master <master-url> \
 --executor-memory <memory> \
 marlin_2.10-0.2-SNAPSHOT.jar \
 <matrix A rows> <martrix A columns> \
 <martrix B columns> <cores cross the cluster>

Note: Because the pre-built Spark-assembly jar doesn't have any files about netlib-java native compontent, which means you cannot use the native linear algebra library e.g BLAS to accelerate the computing, but have to use pure java to perform the small block matrix multiply in every worker. We have done some experiments and find it has a significant performance difference between the native BLAS computing and the pure java one, here you can find more info about the performance comparison and how to load native library.

Note: this example use MTUtils.randomDenVecMatrix to generate distributed random matrix in-memory without reading data from files.

Note: <cores cross the cluster> is the num of cores across the cluster you want to use.

##Martix Operations API in Marlin Currently, we have finished some APIs, you can find documentation in this page.

##Algorithms and Performance Evaluation The details of the matrix multiplication algorithm is here.

###Performance Evaluation We have done some performance evaluation of Marlin. It can be seen here.

##Contact gurongwalker at gmail dot com

myasuka at live dot com

More Repositories

1

MR-Course-Assignments

Assignments for courses of MapReduce
Shell
31
star
2

dolphin

Dolphin - a Deep Learning on MIC architecture Project.
C++
24
star
3

SmartFD

SmartFD: Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms
Scala
17
star
4

tachyon-perf

A General Performance Test Framework for Tachyon
Java
16
star
5

dfs-perf

A general performance test framework for Distributed File System
Java
13
star
6

DGST

DGST: Efficient and Scalable Generalized Suffix Tree Construction on Apache Spark
Scala
12
star
7

NAS-CTR

Python
12
star
8

Liquid

Intelligent Resource Requirement Estimation and Scheduling for Deep Learning Jobs on Distributed GPU Clusters
Python
10
star
9

DIFER

Python
10
star
10

forestlayer

ForestLayer: Efficient and scalable deep forest learning library based on Ray
Python
10
star
11

AdaMCL

PyTorch implementation of AdaCML
Python
10
star
12

BigSpa

A framework for large-sacle static program analysis.
Java
9
star
13

HAGNN

Hybrid Aggregation for Heterogeneous Graph Neural Networks
Python
7
star
14

GADAM

5
star
15

seal

Training Large Scale Statistical Machine Translation Models on Spark
Scala
4
star
16

Octopus-DF

A cross-platfrom pandas-like Dataframe based on Pandas, Spark and Dask.
Python
4
star
17

PSP

Python
4
star
18

cichlid

Cichlid is a distributed RDFS & OWL reasoning system based on Spark.
Scala
4
star
19

EAAFE

The codes for paper "Evolutionary Automated Feature Engineering"
Python
3
star
20

AutoAC

Python
3
star
21

AutoMTL

The official implementation of paper *Automatic Multi-Task Learning Framework with Neural Architecture Search in Recommendations*
Python
3
star
22

Coral

Coral: Federated Query Join Order Optimization Based on Deep Reinforcement Learning
Java
2
star
23

Magpie

Efficient Big Data Query System Parameter Optimization based on Pre-selection and Search Pruning Approach
Java
2
star
24

trasa

Implementation for Transition Relation Aware Self-Attention for Session-based Recommendation
Python
2
star
25

PGA

partial attack for graph global attack
Jupyter Notebook
2
star
26

SparkDQ

The code repository for SparkDQ, a big data quality management system.
Python
2
star
27

CasMLN

Python
2
star
28

LLM_Paper_Learning

LLM Papers We Recommend to Read
2
star
29

PMPAS

Python
1
star
30

TSSE

Topic model for short text
Python
1
star
31

Raven

Benchmarking query engines with pre-computation on cloud
Python
1
star
32

FSClientCache

code repo for file system client-side cache paper
Java
1
star
33

UniGPS

Unified Graph Programming Framework
1
star
34

SAGNAS

Python
1
star