• Stars
    star
    681
  • Rank 66,346 (Top 2 %)
  • Language
    Scala
  • License
    Other
  • Created about 8 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A light weight, super fast, large scale machine learning library on spark .

Fregata: Machine Learning

GitHub license

  • Fregata is a light weight, super fast, large scale machine learning library based on Apache Spark, and it provides high-level APIs in Scala.

  • More accurate: For various problems, Fregata can achieve higher accuracy compared to MLLib.

  • Higher speed: For Generalized Linear Model, Fregata often converges in one data epoch. For a 1 billion X 1 billion data set, Fregata can train a Generalized Linear Model in 1 minute with memory caching or 10 minutes without it. Usually, Fregata is 10-100 times faster than MLLib.

  • Parameter Free: Fregata uses GSA SGD optimization, which dosen't require learning rate tuning, because we found a way to calculate appropriate learning rate in the training process. When confronted with super high-dimension problem, Fregata calculates remaining memory dynamically to determine the sparseness of the output, balancing accuracy and efficiency automatically. Both features enable Fregata to be treated as a standard module in data processing for different problems.

  • Lighter weight: Fregata just uses Spark's standard API, which allows it to be integrated into most business’ data processing flow on Spark quickly and seamlessly.

Architecture

This documentation is about Fregata version 0.1

  • core : mainly implements stand-alone algorithms based on GSA, including Classification Regression and Clustering
    • Classification: supports both binary and multiple classification
    • Regression: will release later
    • Clustering: will release later
  • spark : mainly implements large scale machine learning algorithms based on spark by wrapping core.jar and supplies the corresponding algorithms

Fregata supports spark 1.x and 2.x with scala 2.10 and scala 2.11 .

Algorithms

Installation

Two ways to get Fregata by Maven or SBT :

  • Maven's pom.xml
    <dependency>
       <groupId>com.talkingdata.fregata</groupId>
        <artifactId>core</artifactId>
        <version>0.0.3</version>
    </dependency>
    <dependency>
        <groupId>com.talkingdata.fregata</groupId>
        <artifactId>spark</artifactId>
        <version>0.0.3</version>
    </dependency>
  • SBT's build.sbt
    // if you deploy to local mvn repository please add
    // resolvers += Resolver.mavenLocal
    libraryDependencies += "com.talkingdata.fregata" % "core" % "0.0.3"
    libraryDependencies += "com.talkingdata.fregata" % "spark" % "0.0.3"

If you want to manual deploy to local maven repository , as follow :

git clone https://github.com/TalkingData/Fregata.git
cd Fregata
mvn clean package install

Quick Start

Suppose that you're familiar with Spark, the example below shows how to use Fregata's Logistic Regression, and experimental datas can be obtained on LIBSVM Data

  • adding Fregata into project by Maven or SBT referring to the Downloading part
  • importing packages
	import fregata.spark.data.LibSvmReader
	import fregata.spark.metrics.classification.{AreaUnderRoc, Accuracy}
	import fregata.spark.model.classification.LogisticRegression
	import org.apache.spark.{SparkConf, SparkContext}
  • loading training datas by Fregata's LibSvmReader API
    val (_, trainData)  = LibSvmReader.read(sc, trainPath, numFeatures.toInt)
    val (_, testData)  = LibSvmReader.read(sc, testPath, numFeatures.toInt)
  • building Logsitic Regression model by trainging datas
    val model = LogisticRegression.run(trainData)
  • predicting the scores of instances
    val pd = model.classPredict(testData)
  • evaluating the quality of predictions of the model by auc or other metrics
    val auc = AreaUnderRoc.of( pd.map{
      case ((x,l),(p,c)) =>
        p -> l
    })

Input Data Format

Fregata's training API needs RDD[(fregata.Vector, fregata.Num)], predicting API needs the same or RDD[fregata.Vector] without label

	import breeze.linalg.{Vector => BVector , SparseVector => BSparseVector , DenseVector => BDenseVector}
	import fregata.vector.{SparseVector => VSparseVector }

	package object fregata {
	  type Num = Double
	  type Vector = BVector[Num]
	  type SparseVector = BSparseVector[Num]
	  type SparseVector2 = VSparseVector[Num]
	  type DenseVector = BDenseVector[Num]
	  def zeros(n:Int) = BDenseVector.zeros[Num](n)
	  def norm(x:Vector) = breeze.linalg.norm(x,2.0)
	  def asNum(v:Double) : Num = v
	}
  • if the data format is LibSvm, then Fregata's LibSvmReader.read() API can be used directly
	// sc is Spark Context
	// path is the location of input datas on HDFS
	// numFeatures is the number of features for single instance
	// minPartitions is the minimum number of partitions for the returned RDD pointing the input datas
	read(sc:SparkContext, path:String, numFeatures:Int=-1, minPartition:Int=-1):(Int, RDD[(fregata.Vector, fregata.Num)])
  • else some constructions are needed

    • Using SparseVector
     	// indices is an 0-based Array and the index-th feature is not equal to zero
     	// values  is an Array storing the corresponding value of indices
     	// length  is the total features of each instance
     	// label   is the instance's label
    
     	// input datas with label
     	sc.textFile(input).map{
     		val indicies = ...
     		val values   = ...
     		val label    = ...
     		...
     		(new SparseVector(indices, values, length).asInstanceOf[Vector], asNum(label))
     	}
    
     	// input datas without label(just for predicting API)
     	sc.textFile(input).map{
     		val indicies = ...
     		val values   = ...
     		...
     		new SparseVector(indices, values, length).asInstanceOf[Vector]
     	}
    • Using DenseVector
     	// datas is the value of each feature
     	// label   is the instance's label
    
     	// input datas with label
     	sc.textFile(input).map{
     		val datas = ...
     		val label = ...
     		...
     		(new DenseVector(datas).asInstanceOf[Vector], asNum(label))
     	}
    
     	// input datas without label(just for predicting API)
     	sc.textFile(input).map{
     		val datas = ...
     		...
     		new DenseVector(indices, values, length).asInstanceOf[Vector]
     	}

MailList:

Contributors:

Contributed by TalkingData .

More Repositories

1

iview-weapp

一套高质量的微信小程序 UI 组件库
Less
6,587
star
2

inmap

大数据地理可视化
JavaScript
2,772
star
3

owl

Distributed monitoring system
Go
838
star
4

YourView

YourView is a desktop App in MacOS based on Apple SceneKit. You may use it to view iOS App's view hierarchy 3D.
Objective-C
628
star
5

Myna

A context awareness framework for Android platform
Java
157
star
6

owl-frontend

Vue
63
star
7

pecker-c

🐦前端应用异常监控、分析平台
TypeScript
52
star
8

tap2debug

An iOS SpringBoard tweak,double click to start the debug server.
Logos
28
star
9

AppAnalytics_SDK_ReactNative

TalkingData react-native SDK封装层
Objective-C
19
star
10

fsd

CSS
14
star
11

rxloop

rxloop = Redux + redux-observable (Inspired by dva)
JavaScript
10
star
12

Shrike

Docker扁平二层网络工具与Swarm集群管理工具
Go
8
star
13

eago

Distributed internal O&M and it platform, Refactoring by Golang based on micro service.
Go
7
star
14

AppAnalytics_SDK_Unity

C#
7
star
15

AppAnalytics_SDK_Hybrid

Objective-C
6
star
16

magpie

Magpie is a command line tool for deploying and managing Yarn on Docker cluster.
Go
6
star
17

AppAnalytics_SDK_Plugin

一款基于TalkingData AppAnalytics SDK的示例代码生成插件,专注于为开发者提效,使集成TalkingData SDK变得简单高效。
Java
5
star
18

analytics-openapi-example

TalkingData Analytics openapi调用示例
Java
5
star
19

AppAnalytics_SDK_Cordova

Objective-C
4
star
20

TalkingDataSDK_Flutter

Dart
3
star
21

FragmentDemo

Java
3
star
22

t-design

(Inspired by Ant Design Pro)
CSS
2
star
23

flclover

Build better enterprise frameworks and apps with Node.js & Koa2
JavaScript
2
star
24

AdTracking_SDK_Cordova

TalkingData ADT cordova 插件
Objective-C
1
star
25

SDKMaven

TalkingData SDK Maven Repository
1
star
26

MultiLayerStacking

sklearn-API friendly multi-layer stacking Python module
Python
1
star
27

rxloop-loading

rxloop loading plugin
JavaScript
1
star
28

AdTracking_SDK_Unity

C#
1
star
29

TalkingDataSDK_Unity

C#
1
star
30

todo-app-with-rxloop

todo app with rxloop
JavaScript
1
star