• This repository has been archived on 09/Apr/2021
  • Stars
    star
    391
  • Rank 110,003 (Top 3 %)
  • Language
    Scala
  • License
    Other
  • Created about 10 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Distributed decision tree ensemble learning in Scala

Brushfire

Brushfire

Brushfire is a framework for distributed supervised learning of decision tree ensemble models in Scala.

The basic approach to distributed tree learning is inspired by Google's PLANET, but considerably generalized thanks to Scala's type parameterization and Algebird's aggregation abstractions.

Brushfire currently supports:

  • binary and multi-class classifiers
  • numeric features (discrete and continuous)
  • categorical features (including those with very high cardinality)
  • k-fold cross validation and random forests
  • chi-squared test as a measure of split quality
  • feature importance and brier scores
  • Scalding/Hadoop as a distributed computing platform

In the future we plan to add support for:

  • regression trees
  • CHAID-like multi-way splits
  • error-based pruning
  • many more ways to evaluate splits and trees
  • Spark and single-node in-memory platforms

Authors

Thanks for assistance and contributions:

Quick start

sbt brushfireScalding/assembly
cd example
./iris
cat iris.output/step_03

If it worked, you should see a JSON representation of 4 versions of a decision tree for classifying irises.

To use brushfire in your own SBT project, add the following to your build.sbt:

libraryDependencies += "com.stripe" %% "brushfire" % "0.6.3"

To use brushfire as a jar in your own Maven project, add the following to your POM file:

<dependency>
  <groupId>com.stripe</groupId>
  <artifactId>brushfire_${scala.binary.version}</artifactId>
  <version>0.6.3</version>
</dependency>

Using Brushfire with Scalding

The only distributed computing platform that Brushfire currently supports is Scalding, version 0.12 or later.

The simplest way to use Brushfire with Scalding is by subclassing TrainerJob and overriding trainer to return an instance of Trainer. Example:

import com.stripe.brushfire._
import com.stripe.brushfire.scalding._
import com.twitter.scalding._

class MyJob(args: Args) extends TrainerJob(args) {
  import JsonInjections._

  def trainer = ???
}
```

You should import either `JsonInjections` or `KryoInjections` to specify serialization in either JSON or base64-encoded Kryo, respectively; the former has the advantage of being human readable, the latter is more efficient, which can be important for very large trees.

To construct a `Trainer`, you need to pass it training data as a Scalding `TypedPipe` of Brushfire [Instance[K, V,T]](http://stripe.github.io/brushfire/#com.stripe.brushfire.Instance) objects. `Instance` looks like this:

````scala
case class Instance[K, V, T](id: String, timestamp: Long, features: Map[K, V], target: T)
  • The id should be unique for each instance.
  • If there's an associated observation time, it should be the timestamp. (Otherwise 0L is fine)
  • features is a Map from feature name (type K, usually String) to some value of type V. There's built-in implicit support for Int, Double, Boolean, and String types (with the assumption for Int and String that there is a small, finite number of possible values). If, as is common, you need to mix different feature types, see the section on Dispatched below.
  • the only built-in support for target currently is for Map[L,Long], where L represents some label type (for example Boolean for a binary classifier or String for multi-class). The Long values represent the weight for the instance, which is usually 1.

Example:

Instance("AS-2014", 1416168857L, Map("lat" -> 49.2, "long" -> 37.1, "altitude" -> 35000.0), Map(true -> 1L))

You also need to pass it a Sampler. Here are some samplers you might use:

One you have constructed a Trainer, you most likely want to call expandTimes(base: String, times: Int). This will build a new ensemble of trees from the training data and expand them times times, to depth times. At each step, the trees will be serialized to a directory (on HDFS, unless you're running in local mode) under base.

Fuller example:

import com.stripe.brushfire._
import com.stripe.brushfire.scalding._
import com.twitter.scalding._

class MyJob(args: Args) extends TrainerJob(args) {
  import JsonInjections._

  def trainingData: TypedPipe[Instance[K, V,T]] = ???
  def trainer = Trainer(trainingData, KFoldSampler(4)).expandTimes(args("output"), 5)
}

#In Memory Expansion

Having expanded as deep as you want using the distributed algorithm, you may wish to ask for further, in-memory expansion of any nodes that are sufficiently small at this point by calling expandSmallNodes(path: String, times: Int). By default, this will downsample every node to at most 10,000 instances of training data, and expand until they have fewer than 10 instances. You may need to tune this value, which you do by setting an implicit Stopper:

val implicit stopper = FrequencyStopper(10000, 10)
trainer.expandInMemory(args("output") + "/mem", 100)
```

Note that the distributed algorithm will *stop* expanding at the same instance count that the in-memory algorithm wants, ie, 10,000 instances by default.

# Dispatched

If you have mixed features, the recommended value type is `Dispatched[Int,String,Double,String]`, which requires your feature values to match any one of these four cases:

* `Ordinal(v: Int)` for numeric features with a reasonably small number of possible values
* `Nominal(v: String)` for categorical features with a reasonably small number of possible values
* `Continuous(v: Double)` for numeric features with a large or infinite number of possible values
* `Sparse(v: String)` for categorical features with a large or infinite number of possible values

Note that using `Sparse` and especially `Continuous` features will currently slow learning down considerably. (But on the other hand, if you try to use `Ordinal` or `Nominal` with a feature that has hundreds of thousands of unique values, it will be even slower, and then fail).

Example of a features map:

````scala
Map("age" -> Ordinal(35), "gender" -> Nominal("male"), "weight" -> Continuous(130.23), "name" -> Sparse("John"))

Extending Brushfire

Brushfire is designed to be extremely pluggable. Some ways you might want to extend it are (from simplest to most involved):

  • Adding a new sampling strategy, to get finer grained control over how instances are allocated to trees, or between the training set and the test set: define a new Sampler
  • Add a new evaluation strategy (such as log-likelihood or entropy): define a new Evaluator
  • Adding a new feature type, or a new way of binning an existing feature type (such as log-binning real numbers): define a new Splitter
  • Adding a new target type (such as real-valued targets for regression trees): define a new Evaluator, a new Stopper and quite likely also define a new Splitter for any continuous or sparse feature types you want to be able to use.
  • Add a new distributed computation platform: define a new equivalent of Trainer, idiomatically to the platform you're using. (There's no specific interface this should implement.)

More Repositories

1

jquery.payment

[DEPRECATED] A general purpose library for building credit card forms, validating inputs and formatting numbers.
CoffeeScript
3,538
star
2

react-stripe-elements

Moved to stripe/react-stripe-js.
JavaScript
3,026
star
3

mosql

MongoDB → PostgreSQL streaming replication
Ruby
1,629
star
4

stripe-payments-demo

Sample store accepting universal payments on the web with Stripe Elements, Payment Request, Apple Pay, Google Pay, Microsoft Pay, and the PaymentIntents API. 💳🌍✨
JavaScript
1,471
star
5

shop

Single-page shop
CSS
1,126
star
6

flow-to-typescript-codemod

Codemod Stripe used to migrate 6.5m+ lines of code from Flow to TypeScript
TypeScript
675
star
7

safesql

Static analysis tool for Golang that protects against SQL injections
Go
563
star
8

PaymentKit

Easily accept payments on iOS
Objective-C
470
star
9

stripe-webhook-monitor

Stripe Webhook Monitor provides a real-time feed and graph of Stripe events received via webhooks. 📈✨
JavaScript
366
star
10

accept-a-card-payment

Learn how to accept a basic card payment on web, iOS, Android
Java
351
star
11

jquery.mobilePhoneNumber

[DEPRECATED] A general purpose library for validating and formatting mobile phone numbers.
CoffeeScript
331
star
12

nextjs-typescript-react-stripe-js

Full-stack TypeScript example using Next.js, react-stripe-js, and stripe-node.
TypeScript
329
star
13

topmodel

Standard evaluations for binary classifiers so you don't have to
Python
316
star
14

gaps

Easy management of your Google Groups subscriptions.
Ruby
284
star
15

developer-office-hours

A collection of Stripe Developer Office Hours demos 🎬
Ruby
245
star
16

ApplePayStubs

Test your Apple Pay integration without Apple Pay
Objective-C
193
star
17

timberlake

Timberlake is a Job Tracker for Hadoop.
Go
177
star
18

wilde-things

A tutorial integrating Stripe in PHP
PHP
175
star
19

sequins

A key/value store for serving static batch data
Go
174
star
20

checkout-subscription-and-add-on

Uses Stripe Checkout to create a payment page that starts a subscription for a new customer.
CSS
162
star
21

mongoriver

A library for writing MongoDB oplog tailers.
Ruby
153
star
22

stripe-demo-connect-kavholm-marketplace

Demo app for Global Marketplace using Stripe Connect
JavaScript
139
star
23

herringbone

Tools for working with parquet, impala, and hive
Thrift
135
star
24

pd2pg

Import PagerDuty data into Postgres for analysis
Ruby
110
star
25

payment-form-modal

How to implement Stripe Elements within a modal dialog.
JavaScript
106
star
26

datadog-checks

Checks for the Datadog Agent that Stripe finds useful.
Python
99
star
27

set-up-subscriptions

Getting started with Stripe Elements and Stripe Billing to charge a customer for a monthly subscription.
CSS
96
star
28

macgyver

A Chrome extension which duct tapes an SSH agent to the platformKey API
Go
90
star
29

react-elements-card-payment

Learn how to build a checkout form with React
CSS
87
star
30

chalk-log

Chalk::Log adds a logger object to any class, which can be used for unstructured or semi-structured logging.
Ruby
72
star
31

agate

Scoring ONNX models on the JVM in scala
Scala
68
star
32

sbt-bazel

Easily convert SBT projects to Bazel workspaces
Scala
54
star
33

charging-for-multiple-plan-subscriptions

Getting started with Stripe Elements and Stripe Billing to charge a customer for a monthly subscription with multiple items.
JavaScript
54
star
34

checkout-remember-me-with-twilio-verify

Use Stripe Checkout to collect payment details for future payments and Twilio Verify to authenticate the customer via SMS code and charge their stored card.
JavaScript
50
star
35

firebase-mobile-payments

Firebase Cloud Functions to create payments in native Android and iOS applications.
Kotlin
49
star
36

identity-verification

Securely collect and verify identity documents
JavaScript
44
star
37

falconer

High throughout, unsampled tracing span buffer with streaming search
Go
40
star
38

web-elements-sepa-debit-payment

Collect SEPA Debit mandates and payments.
Objective-C
37
star
39

payment-tag

CoffeeScript
34
star
40

stripe-stdlib-demo

Sample store accepting universal payments built with @Stripe and @StdLib.
JavaScript
33
star
41

chalk-config

Maps on-disk config files into a loaded global configatron instance, taking into account your current environment.
Ruby
28
star
42

go-einhorn

Talk to einhorn from your Go worker
Go
25
star
43

sample-terminal-ios-app

Learn how to take in-person payments with a physical reader and Terminal in your iOS app
Swift
19
star
44

adding-sales-tax

Learn how to use PaymentIntents to build a simple checkout flow
CSS
18
star
45

javascript-style

Javascript linter with rules for Stripe projects
JavaScript
16
star
46

scrooge-shapes

Shapeless generic instances for Scrooge types
Scala
14
star
47

datadog-cli-tools

CLI tools we find useful for Datadog
Ruby
13
star
48

submigrate

Combine multiple subscriptions into a single subscription with multiple items
Go
12
star
49

web-elements-fpx-payment

Accept Malaysian online bank transfers with the Stripe FPX Element.
JavaScript
12
star
50

siv-go

A pure Go implementation of the SIV AEAD.
Go
11
star
51

au-becs-debit-payment

Collecting AU BECS Direct Debit mandates and payments.
Java
10
star
52

oxxo-payment

Learn how to accept OXXO and card payments
JavaScript
10
star
53

round-up-and-donate

Build a round up and donate feature with Connect
CSS
10
star
54

random

A collection of random utilities
Shell
9
star
55

web-elements-card-payment

Learn how to accept a basic card payment on the web
JavaScript
7
star
56

grabpay-payment

Accept GrabPay Payments with Stripe, a popular digital wallet in Southeast Asia.
CSS
5
star
57

yard-sorbet

Types are documentation
Ruby
5
star
58

terraform-provider-confidant

A terraform provider for confidant. See https://github.com/terraform-providers
Go
5
star
59

simple-powershell-dsc

Simple Powershell DSC pull server in Go
Go
4
star
60

stripe-magento1-releases

4
star
61

pb

Lint protocol buffers
Go
2
star
62

mobile-elements-card-payment

Learn how to accept a basic card payment on iOS & Android
Java
2
star
63

bazel-bloop-exporter

This proof of concept exports a bazel project to bloop. The motivation is to allow the use of any tooling that already has a bloop integration, such as the metals language server.
Starlark
2
star
64

sentry-restricted-github

Python
2
star
65

time-utils

Ruby
1
star