• Stars
    star
    356
  • Rank 116,958 (Top 3 %)
  • Language
    Scala
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A tool for monitoring and tuning Spark jobs for efficiency.

Sparklint

Maven Central

The missing Spark Performance Debugger that can be drag and dropped into your spark application!

Featured in Spark Summit 2016 EU: Introduction | Slides | Video
Spark Summit 2017 SF: Introduction | Slides | Video

Mission

  • Provide advance metrics and better visualization about your spark application's resource utilization
    • Vital application life time stats like idle time, average vcore usage, distribution of locality
    • View task allocation by executor
    • VCore usage graphs by FAIR scheduler group
    • VCore usage graphs by task locality
    • (WIP) Find rdds that can benefit from persisting
  • Help you find out where the bottle neck are
    • (WIP) Automated report on application bottle neck
  • (WIP) Opportunity to give the running application real-time hints on magic numbers like partitions size, job submission parallelism, whether to persist rdd or not

ScreenShot

Usage

First figure out your current spark version. Please refer to project/BuildUtils.scala and look for SUPPORTED_SPARK_VERSIONS, these are the versions that we support out of box.

If your spark version has our precompiled sparklint jar (let's say 1.6.1 on scala 2.10), the jar name will be sparklint-spark161_2.10. Please note, in the jar the 161 means Spark1.6.1 and 2.10 means scala 2.10

If your spark version is not precompiled (i.e. 1.5.0), you can add an entry in project/BuildUtils.getSparkMajorVersion, then provide compatible code similar to spark-1.6 in src/main/spark-1.5

For more detail about event logging, how to enable it, and how to gather log files, check http://spark.apache.org/docs/latest/configuration.html#spark-ui

Live mode (run inside spark driver node)

SparklintListener is an implementation of SparkFirehoseListener that listen to spark event log while the application is running. To enable it, you can try one of the following:

  1. Upload packaged jar to your cluster, include jar in classpath directly
  2. Use --packages command to inject dependency during job submission if we have a precompiled jar, like --conf spark.extraListeners=com.groupon.sparklint.SparklintListener --packages com.groupon.sparklint:sparklint-spark201_2.11:1.0.8
  3. Add dependency directly in your pom, repackage your application, then during job submission, use --conf spark.extraListeners=com.groupon.sparklint.SparklintListener

Finally, find out your spark application's driver node address, open a browser and visit port 23763 (our default port) of the driver node.

Add dependency directly for pom.xml

<dependency>
    <groupId>com.groupon.sparklint</groupId>
    <artifactId>sparklint-spark201_2.11</artifactId>
    <version>1.0.12</version>
</dependency>

for build.sbt

libraryDependencies += "com.groupon.sparklint" %% "sparklint-spark201" % "1.0.12"
Server mode (run on local machine)

SparklintServer can run on your local machine. It will read spark event logs from the location specified. You can feed Sparklint an event log file to playback activities.

  • Checking out the repo
  • Make sure you have SBT installed.
  • Execute sbt run to start the server. You can add a directory, log file, or a remote history server via UI
    • You can also load a directory of log files on startup like sbt "run -d /path/to/log/dir -r"
    • Or analyze a single log file on startup like sbt "run -f /path/to/logfile -r"
    • Or connect to a history server on startup like sbt "run --historyServer http://path/to/server -r"
  • Then open browser and navigate to http://localhost:23763
  • Spark version doesn't matter in server mode
Server mode (docker)
  • Docker support available at https://hub.docker.com/r/roboxue/sparklint/
  • pull docker image from docker hub or build locally with sbt docker
    • sbt docker command will build a roboxue/sparklint:latest image on your local machine
  • This docker image basically wrappend sbt run for you.
    • Attach a dir that contains logs to the image as a volumn, so that you can use -f or -d configs
    • Or just start the docker image and connect to a history server using UI
  • Basic commands to execute the docker image:
    • docker run -v /path/to/logs/dir:/logs -p 23763:23763 roboxue/sparklint -d /logs && open localhost:23763
    • docker run -v /path/to/logs/file:/logfile -p 23763:23763 roboxue/sparklint -f /logfile && open localhost:23763

Config

  • Common config
    • Set the port of the UI (eg, 4242)
      • In live mode, send --conf spark.sparklint.port=4242 to spark submit script
      • In server mode, send --port 4242 to sbt run commandline argument
  • Server only config
    • -f [FileName]: Filename of an Spark event log source to use.
    • -d [DirectoryName]: Directory of an Spark event log sources to use. Read in filename sort order.
    • --historyServer [DirectoryName]: Directory of an Spark event log sources to use. Read in filename sort order.
    • -p [pollRate]: The interval (in seconds) between polling for changes in directory and history event sources.
    • -r: Set the flag in order to run each buffer through to their end state on startup.

Developer cheatsheet

  • First enter sbt console sbt
  • Test: test
  • Cross Scala version test + test
  • Rerun failed tests: testQuick
  • Change spark version: set sparkVersion := "2.0.0"
  • Change scala version: ++ 2.11.8
  • Package: package
  • Perform task (e.g, test) foreach spark version + foreachSparkVersion test
  • Publish to sonatype staging + foreachSparkVersion publishSigned
  • Build docker image to local docker
    • Snapshot version will be tagged as latest
    • Release version will be tagged as the version number
  • Publish existing docker image dockerPublish
  • Build and publish docker image at the same time dockerBuildAndPush
  • The command to release everything:
sbt release # github branch merging
git checkout master
sbt sparklintRelease sonatypeReleaseAll

Change log

1.0.12
  • Addressed an https issue (#76 @rluta)
1.0.11
  • Addressed a port config bug (#74 @jahubba)
1.0.10
  • Addressed a port config bug (#72 @cvaliente)
1.0.9
  • Added cross compile for Spark 2.2.0, 2.2.1
1.0.8
  • Fixes compatibility issue with spark 2.0+ history server api (@alexnikitchuk, @neggert)
  • Fixes docker image's dependencies issue (@jcdauchy)
1.0.7
  • Supports updating graphs using web socket, less likely a refresh will be needed now.
1.0.6
  • Supports breaking down core usage by FAIR scheduler pool

More Repositories

1

greenscreen

CoffeeScript
1,197
star
2

Selenium-Grid-Extras

Simplify the management of the Selenium Grid Nodes and stabilize said nodes by cleaning up the test environment after the build has been completed
Ruby
538
star
3

DotCi

DotCi Jenkins github integration, .ci.yml http://groupon.github.io/DotCi
Java
500
star
4

grox

Grox helps to maintain the state of Java / Android apps.
Java
339
star
5

testium

⛔️ [DEPRECATED] see https://github.com/testiumjs/testium-mocha
CoffeeScript
305
star
6

gleemail

Making email template development fun! Sort of!
CoffeeScript
293
star
7

ansible-silo

Ansible in a self-contained environment via Docker.
Shell
203
star
8

ndu

node disk usage
JavaScript
195
star
9

odo

A Mock Proxy Server
Java
151
star
10

spark-metrics

A library to expose more of Apache Spark's metrics system
Scala
145
star
11

cson-parser

Simple & safe CSON parser
JavaScript
132
star
12

FeatureAdapter

FeatureAdapter (FA) is an Android Library providing an optimized way to display complex screens on Android.
Java
113
star
13

dependency-injection-checks

Dependency Injection Usage Checks
Java
97
star
14

node-cached

A simple caching library for node.js, inspired by the Play cache API
JavaScript
94
star
15

luigi-warehouse

A luigi powered analytics / warehouse stack
Python
87
star
16

codeburner

Security-focused static code analysis for everyone
Ruby
83
star
17

gofer

A general purpose service client library for node.js
JavaScript
83
star
18

locality-uuid.java

Java
80
star
19

jesos

Java
51
star
20

swagql

Create a GraphQL schema from swagger spec
JavaScript
46
star
21

quinn

A set of convenient helpers to use promises to handle http requests
JavaScript
40
star
22

robo-remote

RoboRemote is a remote control framework for Robotium. The goal of RoboRemote is to allow for more complex test scenarios by letting the automator write their tests using standard desktop Java/JUnit. All of the Robotium Solo commands are available. RoboRemote also provides some convencience classes to assist in common tasks such as interacting with list views.
Java
40
star
23

webdriver-http-sync

sync http implementation of the WebDriver protocol for Node.js
JavaScript
39
star
24

mysql_slowlogd

Daemon that serves MySQL's slow query log via HTTP as a streaming download
Shell
36
star
25

mongo-deep-mapreduce

Use Hadoop MapReduce directly on Mongo data
Java
30
star
26

tdsql

Run SQL queries against a Teradata data warehouse server
Perl
29
star
27

nlm

Lifecycle manager for node projects
JavaScript
29
star
28

selenium-download

allow downloading of latest selenium standalone server and chromedriver
JavaScript
29
star
29

monsoon

An extensible monitor system that checks java processes and exposes metrics based on them.
Java
28
star
30

backbeat

A workflow service for processing asynchronous tasks across distributed systems
Ruby
28
star
31

Message-Bus

Java
25
star
32

retromock

Like Wiremock for Retrofit, but faster.
Java
24
star
33

report-card

An Open Source Report Card
JavaScript
23
star
34

nakala

Java
22
star
35

assertive

Assertive is a terse yet expressive assertion library
JavaScript
21
star
36

locality-uuid.rb

Ruby
18
star
37

javascript

Guidelines for using Javascript at Groupon
JavaScript
16
star
38

KatMaps

Kotlin
15
star
39

shellot

Slim terminal realtime graphing tool
Ruby
14
star
40

roll

roll - bootstrap or upgrade a Unix host with Roller
C
13
star
41

sycl

Simple YAML Config Library
Ruby
13
star
42

vertx-utils

Java
12
star
43

baryon

Baryon is a library for building Spark Streaming applications that consume data from Kafka.
Scala
11
star
44

params_deserializers

Deserializers for Rails params
Ruby
11
star
45

poller

Poll a URL, and trigger code on changes
Ruby
10
star
46

git-workflow

JavaScript
10
star
47

json-schema-validator

Maven plugin to validate json files against a json schema. Uses https://github.com/fge/json-schema-validator library under the covers
Java
10
star
48

mysql-junit4

Java
9
star
49

vertx-memcache

Java
9
star
50

shared-store

Keeping config data in sync
JavaScript
9
star
51

artemisia

A light-weight configuration driven Data-Integration utility
Scala
8
star
52

pg_consul

C++
8
star
53

vertx-redis

Java
7
star
54

phy

Minimal hyperscript helpers for Preact
JavaScript
6
star
55

mezzanine

Mezzanine is a library built on Spark Streaming used to consume data from Kafka and store it into Hadoop.
Scala
6
star
56

DotCi-Plugins-Starter-Pack

DotCi-Plugins-Starter-Pack - Expansion-pack for DotCi
Java
6
star
57

Novie

Java
5
star
58

backbeat_ruby

A Ruby client for Backbeat workflow service
Ruby
4
star
59

nilo

A dependency injection toolset for building applications
JavaScript
3
star
60

promise

Java
3
star
61

schema-inferer

Scala
2
star
62

two-to-three

Swagger to Open API Converter
Java
2
star
63

assertive-as-promised

extends assertive with promise support
CoffeeScript
2
star
64

jtier-ctx

Java
2
star
65

kmond

Kotlin
2
star
66

api-build-resources

Build related resources files, e.g. checkstyle configs, etc.
2
star
67

tiquette

Have some etiquette. Format your commit messages with a ticket or issue number.
TypeScript
2
star
68

gofer-proxy

Use a `gofer` client as an express middleware
JavaScript
1
star
69

gh-grep

GitHub CLI grep extension
TypeScript
1
star
70

stylint-config-groupon

CSS
1
star
71

coffeelint-config-groupon

CoffeeScript lint setting used at Groupon
JavaScript
1
star
72

installed-package

Run your node tests against an installed version of your package
JavaScript
1
star
73

api-parent-pom

Project to contain parent pom for common plugin configuration across all API team Maven projects.
1
star
74

jdbi-st4

1
star
75

gh-bulk-pr

GitHub CLI bulk-pr extension
TypeScript
1
star