• Stars
    star
    189
  • Rank 203,569 (Top 5 %)
  • Language
    JavaScript
  • License
    Apache License 2.0
  • Created over 9 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Live-updating Spark UI built with Meteor

Spree

Gitter Chat

Spree is a live-updating web UI for Spark built with Meteor and React.

Screencast of a Spark job running and UI updating along with it

Left: Spree pages showing all jobs and stages, updating in real-time; right: a spark-shell running a simple job; see the Screencast gallery in this repo for more examples.

Features!

Spree is a complete rewrite of Spark's web UI, providing several notable benefits…

Real-time Updating

All data on all pages updates in real-time, thanks to Meteor magic.

Persistence, Scalability

Spree offers a unified interface to past- and currently-running Spark applications, combining functionality that is currently spread across Spark's web UI and "history server".

It persists all information about Spark applications to MongoDB, allowing for archival storage that is easily query-able and solves various Spark-history-server issues, e.g. slow load-times, caching problems, etc.

Pagination and sorting are delegated to Mongo for graceful handling of arbitrarily large stages, RDDs, etc., which makes for a cleaner scalability story than Spark's current usage of textual event-log files and in-memory maps on the driver as ad-hoc databases.

Usability

Spree includes several usability improvements, including:

Toggle-able Columns

All tables allow easy customization of displayed columns:

Collapsible Tables

Additionally, whole tables can be collapsed/uncollapsed for easy access to content that would otherwise be "below the fold":

Persistent Preferences/State

Finally, all client-side state is stored in cookies for persistence across refreshes / sessions, including:

  • sort-column and direction,
  • table collapsed/uncollapsed status,
  • table columns' shown/hidden status,
  • pages' displaying one table with "all" records vs. separate tables for "running", "succeeded", "failed" records, etc.

Extensibility, Modularity

Spree is easy to fork/customize without worrying about changing everyones' Spark UI experience, managing custom Spark builds with bespoke UI changes, etc.

It also includes two useful standalone modules for exporting/persisting data from Spark applications:

  • The json-relay module broadcasts all Spark events over a network socket.
  • The slim module aggregates stats about running Spark jobs and persists them to indexed Mongo collections.

These offer potentially-useful alternatives to Spark's EventLoggingListener and event-log files, respectively (Spark's extant tools for exporting and persisting historical data about past and current Spark applications).

Usage

Spree has three components, each in its own subdirectory:

  • ui: a web-app that displays the contents of a Mongo database populated with information about running Spark applications.
  • slim: a Node server that receives events about running Spark applications, aggregates statistics about them, and writes them to Mongo for Spree's ui above to read/display.
  • json-relay: a SparkListener that serializes SparkListenerEvents to JSON and sends them to a listening slim process.

The latter two are linked in this repo as git submodules, so you'll want to have cloned with git clone --recursive (or run git submodule update) in order for them to be present.

Following are instructions for configuring/running them:

Start Spree

First, run a Spree app using Meteor:

git clone --recursive https://github.com/hammerlab/spree.git
cd spree/ui   # the Spree Meteor app lives in ui/ in this repo.
meteor        # run it

You can now see your (presumably empty) Spree dashboard at http://localhost:3000:

If you don't have meteor installed, see "Installing Meteor" below.

Start slim

Next, install and run slim:

npm install -g slim.js
slim

If you have an older unsupported version of npm installed you may get error messages from the above command that contain message failed to fetch from registry. If so, upgrade the version of node and npm and try again.

slim is a Node server that receives events from JsonRelay and writes them to the Mongo instance that Spree is watching.

By default, slim listens for events on localhost:8123 and writes to a Mongo at localhost:3001, which is the default Mongo URL for a Spree started as above.

Run Spark with JsonRelay

If using Spark β‰₯ 1.5.0, simply pass the following flags to spark-{shell,submit}:

--packages org.hammerlab:spark-json-relay:2.0.0
--conf spark.extraListeners=org.apache.spark.JsonRelay

Otherwise, download a JsonRelay JAR:

wget https://repo1.maven.org/maven2/org/hammerlab/spark-json-relay/2.0.0/spark-json-relay-2.0.0.jar

…then tell Spark to send events to it by passing the following arguments to spark-{shell,submit}:

# Include JsonRelay on the driver's classpath
--driver-class-path /path/to/json-relay-2.0.0.jar
  
# Register your JsonRelay as a SparkListener
--conf spark.extraListeners=org.apache.spark.JsonRelay
  
# Point it at your `slim` instance; default: localhost:8123
--conf spark.slim.host=…
--conf spark.slim.port=…

Comparison to Spark UI

Below is a journey through Spark JIRAs past, present, and future, comparing the current state of Spree with Spark's web UI.

~Fixed JIRAs

I believe the following are resolved or worked around by Spree:

Missing Functionality

Functionality known to be present in the existing Spark web UI / history server and missing from Spree:

Future Nice-to-haves

A motley collection of open Spark-UI JIRAs that might be well-suited for fixing in Spree:

  • SPARK-1622: expose input splits
  • SPARK-1832: better use of warning colors
  • SPARK-2533: summary stats about locality-levels
  • SPARK-3682: call out anomalous/concerning/spiking stats, e.g. heavy spilling.
  • SPARK-3957: distinguish/separate RDD- vs. non-RDD-storage.
  • SPARK-4072: better support for streaming blocks.
  • Control spark application / driver from Spree:
  • SPARK-4906: unpersist applications in slim that haven't been heard from in a while.
  • SPARK-7729: display executors' killed/active status.
  • SPARK-8469: page-able viz?
  • Various duration-confusion clarification/bug-fixing:
    • SPARK-8950: "scheduler delay time"-calculation bug
    • SPARK-8778: "scheduler delay" mismatch between event timeline, task list.
  • SPARK-4800: preview/sample RDD elements.

Notes / Implementation Details / FAQ

ECONNREFUSED / MongoError

If you see errors like this when starting slim:

/usr/local/lib/node_modules/slim.js/node_modules/mongodb/lib/server.js:228
        process.nextTick(function() { throw err; })
                                      ^
AssertionError: null == { [MongoError: connect ECONNREFUSED 127.0.0.1:3001]
  name: 'MongoError',
  message: 'connect ECONNREFUSED 127.0.0.1:3001' }

it's likely because you need to start Spree first (by running meteor from the ui subdirectory of this repo).

slim expects to connect to a MongoDB that Spree starts (at localhost:3001 by default).

BYO Mongo

Meteor (hence Spree) spins up its own Mongo instance by default, typically at port 3001.

For a variety of reasons, you may want to point Spree and Slim at a different Mongo instance. The handy ui/start script makes this easy:

$ ui/start -h <mongo host> -p <mongo port> -d <mongo db> --port <meteor port>

Either way, Meteor will print out the URL of the Mongo instance it's using when it starts up, and display it in the top right of all pages, e.g.:

Screenshot of Spree nav-bar showing Mongo-instance URL

Important: for Spree to update in real-time, your Mongo instance needs to have a "replica set" initialized, per this Meteor forum thread.

Meteor's default mongo instance will do this, but otherwise you'll need to set it up yourself. It should be as simple as:

  • adding the --replSet=rs0 flag to your mongod command (where rs0 is a dummy name for the replica set), and
  • running rs.initialize() from a mongo shell connected to that mongod server.

Now your Spark jobs will write events to the Mongo instance of your choosing, and Spree will display them to you in real-time!

Installing Meteor

Meteor can be installed, per their docs, by running:

curl https://install.meteor.com/ | sh

Installing Spree and Slim sans sudo

Meteor

By default, Meteor will install itself in ~/.meteor and attempt to put an additional helper script at /usr/local/bin/meteor.

It's ok to skip the latter if/when it prompts you for your root password by ^Cing out of the script.

Slim

npm install -g slim.js may require superuser privileges; if this is a problem, you can either:

  • Install locally with npm, e.g. in your home directory:
    cd ~
    npm install slim.js
    cd ~/node_modules/slim.js
    ./slim
    
  • Run slim from the sources in this repository:
    cd slim  # from the root of this repository; make sure you `git clone --recursive`
    npm install
    ./slim
    

More Screencasts

See the screencast gallery in this repo for more GIFs showing Spree in action!

Spark Version Compatibility

Spree has been tested pretty heavily against Spark 1.4.1. It's been tested less heavily, but should Just Workβ„’, on Sparks from 1.3.0, when the spark.extraListeners conf option was added, which JsonRelay uses to register itself with the driver.

Contributing, Reporting Issues

Please file issues if you have any trouble using Spree or its sub-components or have any questions!

See slim's documentation for info about ways to report issues with it.

More Repositories

1

pileup.js

Interactive in-browser track viewer
JavaScript
274
star
2

grafana-spark-dashboards

Scripts for generating Grafana dashboards for monitoring Spark jobs
JavaScript
241
star
3

survivalstan

Library of Stan Models for Survival Analysis
Jupyter Notebook
123
star
4

cytokit

Microscopy Image Cytometry Toolkit
Jupyter Notebook
115
star
5

ppx_deriving_cmdliner

Ppx_deriving plugin for generating command line interfaces from types (Cmdliner.Term.t)
OCaml
96
star
6

flowdec

TensorFlow Deconvolution for Microscopy Data
Jupyter Notebook
88
star
7

guacamole

Spark-based variant calling, with experimental support for multi-sample somatic calling (including RNA) and local assembly
Scala
83
star
8

ketrew

Keep Track of Experimental Workflows
OCaml
76
star
9

yarn-logs-helpers

Scripts for parsing / making sense of yarn logs
Shell
52
star
10

genspio

Generate Shell Phrases In OCaml
OCaml
48
star
11

dask-distributed-on-kubernetes

Deploy dask-distributed on google container engine using kubernetes
Jupyter Notebook
40
star
12

data-canvas

Improved event handling and testing for the HTML5 canvas
JavaScript
38
star
13

cycledash

Variant Caller Analysis Dashboard and Data Management System
Python
35
star
14

prohlatype

Probabilistic HLA typing
OCaml
35
star
15

kubeface

python parallel map on kubernetes
Python
34
star
16

epidisco

Personalized cancer epitope discovery and peptide vaccine prediction pipeline
OCaml
30
star
17

sosa

The Sane OCaml String API
OCaml
27
star
18

biokepi

Bioinformatics Ketrew Pipelines
OCaml
27
star
19

spark-tests

Utilities for writing tests that use Apache Spark.
Scala
24
star
20

t-cell-relation-extraction

Literature mining for T cell relations
Jupyter Notebook
23
star
21

multi-omic-urothelial-anti-pdl1

Contribution of systemic and somatic factors to clinical response and resistance in urothelial cancer: an exploratory multi-omic analysis
Jupyter Notebook
22
star
22

vcf.js

A VCF parser and variant record model in JavaScript.
JavaScript
22
star
23

magic-rdds

Miscellaneous functionality for manipulating Apache Spark RDDs.
Scala
22
star
24

cohorts

Utilities for analyzing mutations and neoepitopes in patient cohorts
Python
20
star
25

spark-bam

Load genomic BAM files using Apache Spark
Scala
20
star
26

pygdc

Python API for Genomic Data Commons
Python
18
star
27

concordance

Concordance between variant callers
JavaScript
17
star
28

shapeless-utils

type-classes for structural manipulation of algebraic data types
Scala
17
star
29

bai-indexer

Build an index for your BAM Index (BAI)
Python
17
star
30

spark-json-relay

SparkListener that converts SparkListenerEvents to JSON and forwards them to an external service via RPC.
Scala
17
star
31

coclobas

Configurable Cloudy Batch Scheduler
OCaml
16
star
32

spark-util

low-level helpers for Apache Spark libraries and tests
Scala
16
star
33

t-cell-guide

Human Primary T cells: A Practical Guide
Jupyter Notebook
15
star
34

awesome-clonality

A curated list of awesome clonality and tumor heterogeneity resources
15
star
35

sbt-parent

SBT plugins for publishing to Maven Central, shading and managing dependencies, reporting to Coveralls from TravisCI, and more
Scala
14
star
36

immuno

Use somatic mutations to choose a personalized cancer vaccine (tumor-specific immunogenic peptides)
Python
14
star
37

pageant

Parallel Genomic Analysis Toolkit
14
star
38

seltest

The simple, fast, visual testing framework for web applications.
Python
13
star
39

stanity

python convenience functions for working with Stan models (via pystan)
Python
13
star
40

slim

Node server that listens to Spark events, aggregates statistics, and writes them to Mongo
JavaScript
10
star
41

vaf-experiments

A step-by-step guide to estimate tumor clonality/purity from variant allele frequency data
Jupyter Notebook
8
star
42

style-guides

Guidelines of the Hammer Lab
8
star
43

vcf-annotate-polyphen

A tool to annotate human VCF files with PolyPhen2 effect measures
Python
8
star
44

math-utils

Math and statistics utilities
Scala
7
star
45

hlarp

Normalize HLA typing output.
OCaml
6
star
46

t-cell-data

TeX
6
star
47

iterators

Enrichment-methods for Scala collections (Iterators, Iterables, Arrays)
Scala
6
star
48

infino

Infino: a Bayesian hierarchical model improves estimates of immune infiltration into tumor microenvironment
Jupyter Notebook
6
star
49

kerseq

Helpers for sequence prediction with Keras
Python
5
star
50

secotrec

Setup Coclobas/Ketrew Clusters
OCaml
5
star
51

immune-infiltrate-explorations

Jupyter Notebook
5
star
52

suffix-arrays

Spark-based implementation of pDC3, a linear-time parallel suffix-array-construction algorithm
TypeScript
5
star
53

spark-genomics

Aggregation of various hammerlab-org genomic, spark, and scala libraries
Scala
5
star
54

wobidisco

Workflows Bioinformatics and Discoballs: The Biokepiverse
5
star
55

igv-httpfs

An adaptor which lets IGV talk to HDFS via HttpFS
Python
5
star
56

redaw

Reinvent the Dataset Wheel
OCaml
4
star
57

melanoma-reanalysis

Online Materials: Somatic Mutations, Neoepitope Homology and Inflammation in Melanomas Treated with CTLA-4 Blockade
4
star
58

idiogrammatik

An extensible, embeddable karyogram for the browser.
JavaScript
4
star
59

cli-utils

Helpers for creating command-line applications
Scala
3
star
60

topeology

Compare neoepitope sequences with epitopes from IEDB
Python
3
star
61

stratotemplate

DEPRECATED: we don't really maintain this any more, we use Coclobas:
OCaml
3
star
62

ngsdiagnostics

Diagnostic Scripts for an NGS Pipeline
Python
3
star
63

ogene

Type-safe scripts for genomic file wrangling
OCaml
3
star
64

bespoke.js

Parsers and fetchers for a cornucopia of bioinformatics formats
JavaScript
3
star
65

mhcflurry-icml-compbio-2016

Data and analysis notebooks for Predicting Peptide-MHC Binding Affinities With Imputed Training Data
Jupyter Notebook
3
star
66

rinfino

R client to run infino (http://github.com/hammerlab/infino)
R
2
star
67

coverage-depth

Generate genomic-coverage-depth histograms using Apache Spark
Scala
2
star
68

SmartCount

Repository for collaboration on Celldom computer vision solutions
Jupyter Notebook
2
star
69

igvxml

Create IGV session files from the command-line
OCaml
2
star
70

paper-aocs-chemo-neoantigens

Manuscript on chemotherapy-induced neoantigens in samples from the Australian Ovarian Cancer Study
Jupyter Notebook
2
star
71

bdgenomics-notebook

2
star
72

flusso

FCS (Flow Cytometry Standard) parser and utility
JavaScript
2
star
73

io-utils

Libraries for console/file I/O, processing/formatting sizes in bytes, etc.
Scala
2
star
74

variant-calling-benchmarks

Automated and curated variant calling benchmarks for Guacamole
Jupyter Notebook
2
star
75

stratocumulus

DEPRECATED: we don't really maintain this any more, we use Coclobas:
OCaml
2
star
76

tcga-blca

Example analysis using Cohorts & TCGA-BLCA data
Jupyter Notebook
2
star
77

cvutils

Computer vision utilities
Python
2
star
78

path-utils

Scala convenience-wrapper for java.nio.file.Path
Scala
2
star
79

discohorts

Generate Cohorts based on Epidisco and/or Biokepi results
Python
1
star
80

spear

WIP: SparkListener that maintains info about jobs, stages, tasks, executors, and RDDs in MongoDB.
Scala
1
star
81

pysigs

Mutational signature deconvolution onto known signatures
Python
1
star
82

celldom-analysis

Repository for Celldom experiment analysis and configuration
Jupyter Notebook
1
star
83

t-cell-electroporation

Code/Data repository for "Electroporation characteristics of human primary T cells"
Jupyter Notebook
1
star
84

genomic-loci

Utilities for representing genomic loci and reference-genomes
Scala
1
star
85

stancache

Filecache for stan models
Python
1
star
86

genomic-reads

Library for representing and working with genomic-sequencing reads.
Scala
1
star
87

avm

Arteriovenous malformations
Python
1
star
88

epidisco-web

Web interface to easily describe and submit epidisco jobs
JavaScript
1
star
89

nosoi

Exploration of evolutionary signatures within viral proteomes by making use of MHC binding predictions
Perl
1
star
90

string-utils

String/CSV utilities
Scala
1
star