• Stars
    star
    125
  • Rank 286,335 (Top 6 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created almost 7 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Profiler for large-scale distributed java applications (Spark, Scalding, MapReduce, Hive,...) on YARN.

Babar: a profiler for large-scale distributed applications

Babar is a profiler for distributed applications developed to profile large-scale distributed applications such as Spark, Scalding, MapReduce or Hive programs.

babar

Babar registers metrics about memory, cpu, garbage collection usage, as well as method calls in each individual container and then aggregates them over the entire application to produce a ready-to-use report of the resource usage and method calls (as flame-graphs) of the program..

Currently babar is designed to profile YARN applications, but could be extended in order to profile other types of applications.

Table of contents

Build

The following tools are required to build the project:

  • maven (version 3 or greater)
  • java (JDK) (version 1.8 or greater)

In order to build the project, run the following command at the root of the project:

mvn clean install

Usage

Babar is composed of three main components:

  • babar-agent

    The babar-agent is a java-agent program. An agent is a jar that can be attached to a JVM in order to instrument this JVM. The agent fetches, at regular interval, information on the resource consumption and logs the resulting metrics in a plain text file named babar.log inside the YARN log directory. YARN's log aggregation at the end of the application combines all the executors logs into a single log file available on HDFS.

  • babar-processor

    The babar-processor is the piece of software responsible for parsing the aggregated log file from the YARN application and aggregating the metrics to produce the report. The logs are parsed as streams which allows to aggregate large logs files (dozens of GB).

    Once the babar-processor has run, a report HTML file is generated, containing all the graphs (memory, CPU usage, GC usage, executor counts, flame-graphs,...). This record can easily be shared by teams and saved for later use.

  • babar-report

    The babar-report is a VueJS project used as the template for the HTML file generated by the processor.

Babar-agent

The babar-agent instruments the JVM to register and log the resource usage metrics. It is a standard java-agent component (see the instrumentation API doc for more information).

In order to add the agent to a JVM, add the following arguments to the java command line used to start your application:

 -javaagent:/path/to/babar-agent-0.2.0-SNAPSHOT.jar=StackTraceProfiler,JVMProfiler[reservedMB=1024],ProcFSProfiler

You will need to replace /path/to/babar-agent-0.2.0-SNAPSHOT.jar with the actual path of the agent jar on your system. This jar must be locally accessible to your JVM (i.e. distributed on all your YARN nodes).

The profilers can be set and configured using this command line by adding parameters within brackets as shown in the following example:

-javaagent:./babar-agent-0.2.0-SNAPSHOT.jar=StackTraceProfiler[profilingMs=1000,reportingMs=600000],JVMProfiler[profilingMs=1000,reservedMB=2560],ProcFSProfiler[profilingMs=1000]

The available profilers and their configuration are described below. They can be used together or independently of each other.

JVMProfiler

The JVMProfiler registers metrics related to the resource usage of the JVM such as memory (heap and off-heap), host and JVM CPU usage, and minor and major GC ratio. Because it uses the JVM instrumentation (see the MXBean documentation), it will only work inside the hotspot JVM (the most commonly used) and will not register metrics for processes ran outside the JVM even if started by it.

This profiler accepts the following parameters:

reservedMB (optional) The amount of memory reserved in megabytes for the container in which the JVM runs. This value allows Babar to plot the reserved memory despite not having access to it (as it is managed by the resource allocator, i.e. YARN).
profilingMs (optional) The interval in milliseconds between each sample (default 1000ms).

ProcFSProfiler

The ProcFSProfiler logs OS-level metrics that are retrieved using the proc file system. This profiler is able to get metrics for the entire process tree, including processes started by the JVM but ran outside of it.

Because the proc filesystem is only available on unix-like systems and its implementation is platform dependent, this profiler will only run on linux systems. You may find more information on the proc filesystem in the official man page.

This profiler accepts the following parameters:

profilingMs (optional) The interval in milliseconds between each sample (default 1000ms).

StackTraceProfiler

The StackTraceProfiler profiler registers the stack traces of all RUNNABLE JVM threads at regular intervals in order to build flame graphs of the time spent by all your JVMs in method calls.

The graphs are built by sampling the stack traces of the JVM at a regular interval (the profilingMs options). The traces are logged at another interval (the reportingMs option) in order to aggregate multiple traces before logging them to save space in the logs, which could otherwise takes hundreds of gigabytes. The traces are always logged on the JVM shutdown, so one can set the reporting interval at a very high value in order to save the most space in the logs if they are not interested in having traces logged in case the JVM is abruptly killed (they will still be logged if an exception is raised by the application code but the JVM is allowed to go through the shutdown hooks).

This profiler accepts the following parameters:

profilingMs (optional) The interval in milliseconds between each sample (default 100ms).
reportingMs (optional) The interval in milliseconds before logging the aggregated traces (default 10min).

Babar-processor

The babar-processor is the piece of software that parses the logs and aggregates the metrics into graphs.

The processor aggregates stack traces as flame graphs flame graphs. Other metrics are aggregated using a given time-precision, which means they are aggregated in time-buckets to aggregate metrics of different container in buckets as all containers will log metrics at a different time. This also allows to significantly reduce the size of the resulting HTML file so that it can easily be shared and exploited. This time-precision is user-adjustable but should always be at least twice the profiling interval set in the profilers.

Usage

  1. aggregating the logs

    The processor needs to parse the application log aggregated by YARN (or any other log aggregation mechanism), either from HDFS or from a local log file that has been fetched using the following command (replace the application id with yours):

    yarn logs --applicationId application_1514203639546_124445 > myAppLog.log
    
  2. parsing the logs to generate the HTML report

    To run the babar-processor, the following command can be used:

    java -jar /path/to/babar-processor.jar myAppLog.log
    

    The processor accepts the following arguments:

    -c, --containers  <arg>         if set, only metrics of containers matching these prefixes are aggregated
                                    (comma-separated)
    -d, --max-traces-depth  <arg>   max depth of stack traces
    -r, --min-traces-ratio  <arg>   min ratio of occurences in profiles for traces to be kept
    -o, --output-file  <arg>        path of the output file (default: ./babar_{date}.html)
    -t, --time-precision  <arg>     time precision (in ms) to use in aggregations
    --help                          Show help message
    
    trailing arguments:
    log-file (required)   the log file to open
    

    Upon completion, the report HTML file is generated with a name such as babar_2018-04-29_12-04-12.html. This file contains all the aggregated measurements.

Profiling a Spark application

No code changes are required to instrument a Spark job since Spark allows to distribute the agent jar archive to all containers using the --files command argument.

In order to instrument your Spark application, simply add these arguments to your spark-submit command:

--files ./babar-agent-0.2.0-SNAPSHOT.jar
--conf spark.executor.extraJavaOptions="-javaagent:./babar-agent-0.2.0-SNAPSHOT.jar=StackTraceProfiler,JVMProfiler[reservedMB=2560],ProcFSProfiler"

You can adjust the reserved memory setting (reservedMB) according to spark.executor.memory + spark.yarn.executor.memoryOverhead.

You can then use the yarn logs command to get the aggregated log file and process the logs using the babar-processor.

Profiling a Scalding or MapReduce application

No code change is required to instrument a MapReduce, Cascading or Scalding application as MapReduce allows distributing jars to the containers from HDFS using the -files argument.

In order to instrument your MapReduce job, use the following arguments:

-files hdfs:///path/to/babar-agent-0.2.0-SNAPSHOT.jar \
-Dmapreduce.map.java.opts="-javaagent:./babar-agent-0.2.0-SNAPSHOT.jar=StackTraceProfiler,JVMProfiler[reservedMB=2560],ProcFSProfiler" \
-Dmapreduce.reduce.java.opts="-javaagent:./babar-agent-0.2.0-SNAPSHOT.jar=StackTraceProfiler,JVMProfiler[reservedMB=3584],ProcFSProfiler"

You can adjust the reserved memory values for mappers and reducers independently. These values can also be automatically set by instrumenting the jobs programmatically.

Profiling a Hive application

Similarly to Spark and MapReduce, Hive allows to easily distribute a jar from HDFS to the executors. To profile an Hive application, simply execute the following commands in Hive before your query:

ADD FILE /path/to/babar-agent-0.2.0-SNAPSHOT.jar;
SET mapreduce.map.java.opts="-javaagent:./babar-agent-0.2.0-SNAPSHOT.jar=StackTraceProfiler,JVMProfiler[reservedMB=2560],ProcFSProfiler";
SET mapreduce.reduce.java.opts="-javaagent:./babar-agent-0.2.0-SNAPSHOT.jar=StackTraceProfiler,JVMProfiler[reservedMB=3584],ProcFSProfiler";

As for other MapReduce applications, reserved memory values will need to be adjusted for mappers and reducers independently.

License

Copyright 2018 Criteo

Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

More Repositories

1

autofaiss

Automatically create Faiss knn indices with the most optimal similarity search parameters.
Python
811
star
2

cassandra_exporter

Apache Cassandra® metrics exporter for Prometheus
Java
169
star
3

biggraphite

Simple Scalable Time Series Database
Python
130
star
4

kerberos-docker

Run kerberos environment in docker containers
Shell
124
star
5

cuttle

An embedded job scheduler.
Scala
114
star
6

kafka-sharp

A C# Kafka driver
C#
110
star
7

lolhttp

An HTTP Server and Client library for Scala.
Scala
91
star
8

tf-yarn

Train TensorFlow models on YARN in just a few lines of code!
Python
86
star
9

consul-templaterb

consul-template-like with erb (ruby) template expressiveness
Ruby
77
star
10

Spark-RSVD

Randomized SVD of large sparse matrices on Spark
Scala
77
star
11

JVips

Java wrapper for libvips using JNI.
Java
72
star
12

deepr

The deepr module provide abstractions (layers, readers, prepro, metrics, config) to help build tensorflow models on top of tf estimators
Python
50
star
13

cluster-pack

A library on top of either pex or conda-pack to make your Python code easily available on a cluster
Python
45
star
14

findjars

Gradle plugin to debug classpath issues
Kotlin
45
star
15

py-consul

Python client for Consul (http://www.consul.io/)
Python
44
star
16

kafka-ganglia

Kafka Ganglia Metrics Reporter
Java
39
star
17

garmadon

Java event logs collector for hadoop and frameworks
Java
39
star
18

graphite-remote-adapter

Fully featured graphite remote adapter for Prometheus
Go
38
star
19

command-launcher

A command launcher 🚀 made with ❤️
Go
36
star
20

marathon_exporter

A Prometheus metrics exporter for the Marathon Mesos framework
Go
34
star
21

haproxy-spoe-auth

Plugin for authorizing users against LDAP
Go
33
star
22

netprobify

Network probing tool crafted for datacenters (but not only)
Python
31
star
23

vizsql

Scala and SQL happy together.
Scala
28
star
24

haproxy-spoe-go

An implementation of the SPOP protocol in Go. https://www.haproxy.org/download/2.0/doc/SPOE.txt
Go
28
star
25

CriteoDisplayCTR-TFOnSpark

Python
27
star
26

netcompare

Python
26
star
27

loop

enhance your web application development workflow
JavaScript
26
star
28

openapi-comparator

C#
25
star
29

fromconfig

A library to instantiate any Python object from configuration files.
Python
24
star
30

vertica-hyperloglog

C++
22
star
31

slab

An extensible Scala framework for creating monitoring dashboards.
Scala
22
star
32

consul-bench

A tool to bench Consul Clusters
Go
20
star
33

socco

A Scala compiler plugin to generate documentation from Scala source files.
Scala
20
star
34

hwbench

hwbench is a benchmark orchestration tool to automate the low-level testing of servers.
Python
20
star
35

mesos-term

Web terminal and sandbox explorer for your mesos containers
TypeScript
19
star
36

memcache-driver

Criteo's .NET MemCache driver
C#
16
star
37

NinjaTurtlesMutation

C#
16
star
38

defcon

DefCon - Status page and API for production status
Python
16
star
39

vagrant-winrm

Vagrant 1.6+ plugin extending WinRM communication features
Ruby
16
star
40

criteo-python-marketing-sdk

Official Python SDK to access the Criteo Marketing API
Python
15
star
41

mesos-external-container-logger

Mesos container logger module for logging to processes, backported from MESOS-6003
C++
13
star
42

android-publisher-sdk

Criteo Publisher SDK for Android
Java
12
star
43

lobster

Simple loop job runner
Ruby
12
star
44

ocserv-exporter

ocserv exporter for Prometheus
Go
11
star
45

berilia

Create hadoop cluster in aws ec2 for development
Scala
11
star
46

mlflow-yarn

Backend implementation for running MLFlow projects on Hadoop/YARN.
Python
10
star
47

openpass

TypeScript
10
star
48

ios-publisher-sdk

Criteo Publisher SDK for iOS
Objective-C
10
star
49

traffic-mirroring

Go
8
star
50

ipam-client

Python ipam-client library
Python
7
star
51

eslint-plugin-criteo

JavaScript
7
star
52

tableau-parser

Scala
7
star
53

gourde

Flask sugar for Python microservices
Python
7
star
54

criteo-java-marketing-sdk

Official Java SDK to access the Criteo Marketing API
Java
7
star
55

metrics-net

Archived: Capturing CLR and application-level metrics. So you know what's going on.
C#
6
star
56

casspoke

Prometheus probe exporter for Cassandra latency and availability
Java
6
star
57

newman-server

A simple webserver to run Postman collections using the newman engine
TypeScript
6
star
58

mewpoke

Memcached / couchbase probe
Java
6
star
59

je-code-crazy-filters

Python
6
star
60

http-proxy-exporter

Expose proxy performance statistics in a Prometheus-friendly way.
Go
5
star
61

kitchen-transport-speedy

Speed up kitchen file transfer using archives
Ruby
5
star
62

AFK

5
star
63

vertica-datasketch

C++
5
star
64

django-memcached-consul

Used consul discovered memcached servers
Python
4
star
65

skydive-visualizer

Go
4
star
66

log4j-jndi-jar-detector

Application trying to detect processes vulnerable to log4j JNDI exploit
Go
4
star
67

criteo-api-python-sdk

Python
4
star
68

RabbitMQHare

High-level RabbitMQ C# client
C#
4
star
69

hardware-manifesto

Criteo's hardware operating principles manifest
4
star
70

automerge-plugin

Gerrit plugin to automatically merge reviews
Java
4
star
71

cassback

This project aims to backup Cassandra SSTables and store them into HDFS
Ruby
4
star
72

vertica-hll-druid

C++
3
star
73

fromconfig-mlflow

A fromconfig Launcher for MlFlow
Python
3
star
74

hive-client

A Pure Scala/Thrift Hive Client
Thrift
3
star
75

android-events-sdk

Java
3
star
76

rundeck-dsl

Groovy
3
star
77

tableau-maven-plugin

Java
3
star
78

ml-hadoop-experiment

Python
3
star
79

android-publisher-sdk-examples

Java
3
star
80

vault-auth-plugin-chef

Go
3
star
81

mesos-command-modules

Mesos modules running external commands
C++
3
star
82

sonic-saltstack

Saltstack modules for SONiC
Python
3
star
83

scala-schemas

use scala classes as schema definition across different systems
Scala
3
star
84

tf-collective-all-reduce

Lightweight framework for distributed TensorFlow training based on dmlc/rabit
Python
3
star
85

criteo-marketing-sdk-generator

A Gradle project to generate custom SDKs for Criteo's marketing API
Mustache
3
star
86

s3-probe

Go
3
star
87

blackbox-prober

Go
3
star
88

netbox-network-cmdb

Python
3
star
89

kitchen-vagrant_winrm

A test-kitchen driver using vagrant-winrm
Ruby
2
star
90

graphite-dashboard-api

Graphite Dashboard API
Ruby
2
star
91

nrpe_exporter

Go
2
star
92

criteo-dotnet-blog

C#
2
star
93

criteo-java-marketing-transition-sdk

Java
2
star
94

pgwrr

Python
2
star
95

criteo-python-marketing-transition-sdk

Python
2
star
96

privacy

2
star
97

knife-ssh-agent

Authenticate to a chef server using a SSH agent
Ruby
2
star
98

carbonate-utils

Utilities for carbonate - resync whisper easilly
Python
2
star
99

ios-events-sdk

Objective-C
2
star
100

gtm-criteo-useridentification

Smarty
2
star