• This repository has been archived on 30/Oct/2020
  • Stars
    star
    191
  • Rank 195,532 (Top 4 %)
  • Language
    Java
  • License
    Other
  • Created over 11 years ago
  • Updated over 10 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Hadoop log aggregator and dashboard

White Elephant

White Elephant is a Hadoop log aggregator and dashboard which enables visualization of Hadoop cluster utilization across users.

Screenshot

Quick Start

To try out the server with some test data:

cd server
ant
./startup.sh

Then visit http://localhost:3000. It may take a minute for the test data to load.

Hadoop Version Compatibility

White Elephant is compiled and tested against Hadoop 1.0.3 and should work with any 1.0.x version. Hadoop 2.0 is not yet supported.

Server

The server is a JRuby web application. In a production environment it can be deployed to tomcat and reads aggregated usage data directly from Hadoop. This data is stored in an in-memory database provided by HyperSQL. Charting is provided by Rickshaw.

Getting started

To get started using the server, first set up the environment:

cd server
ant

The default target does several things, among them:

  • Installs JRuby to a local directory under .rbenv
  • Installs Ruby gems to the above directory
  • Downloads JARs
  • Creates test data under data/usage

At this point you should be able to start the server:

./startup.sh

You can now visit http://localhost:3000. It may take a minute for the test data to load.

This uses trinidad to run the JRuby web app in development mode. Since it is in development mode the app assumes local data should be used, which it looks for in the directory specified in config.yml.

Configuration

The server configuration is contained in config.yml. You can see a sample in sample_config.yml.

When run in development mode using ./startup.sh, sample_config.yml is used and it follows the settings specified under local. The only configurable parameter here is file_pattern, which specifies where to load the usage data from on local disk.

When packaged as a WAR it runs in production mode and uses configuration specified under hadoop, the assumption being that the aggregated usage data will be available there. The following parameter must be specified:

  • file_pattern: Glob pattern to load usage files from Hadoop
  • libs: Directories containing Hadoop JARs (to be added to the classpath)
  • conf_dir: Directory containing Hadoop configuration (to be added to the classpath)
  • principal: User name used to access secure Hadoop
  • keytab: Path to keytab file for user to access secure Hadoop

White Elephant does not assume a specific version of Hadoop, so the JARs are not packaged in the WAR. Therefore the path to the Hadoop JARs must be specified in the configuration.

Deploying

To build a WAR which can be deployed to tomcat:

ant war -Dconfig.path=<path-to-config>

The config file you specify will be packaged as config.yml within the WAR. See sample_config.yml as an example for how to write the config file.

Hadoop Log Uploading

The script hadoop/scripts/statsupload.pl can be used to upload the Hadoop logs to HDFS so they can be processed. Check its documentation for details.

Hadoop Jobs

There are three Hadoop jobs, all managed by a job executor which keeps track of what work needs to be done.

The first two jobs parse and convert raw job configurations and logging into an easier-to-work-with Avro format. Together, these two datasets can serve as the base data for a variety of usage analytics workflows.

The third and final job reads the Avro-fied log data and aggregates it per hour, writing the data out in Avro format. It essentially builds a data cube which can be easily loaded by the web application into the DB and queried against.

Configuration

Some sample configuration files can be found under hadoop/config/jobs:

  • base.properties: Contains most of the configuration
  • white-elephant-full-usage.job: Job file used when processing all logs.
  • white-elephant-incremental-usage.job: Job file used when incrementally processing logs.

The base.properties file consists of configuration specific to White Elephant and configuration specifically for Hadoop. All Hadoop configuration parameter begin with hadoop-conf. The two job files just have a single settings incremental and only differ in the value they use for it.

Hadoop Logs

Within base.properties is a parameter logs.root. This is the root path where the Hadoop logs are found which are to be parsed. The parsing job assumes the logs are stored in Hadoop under daily directories using the following directory structure:

<logs.root>/<cluster-name>/daily/<yyyy>/<MMdd>

For example, logs on January 23rd, 2013 for the production cluster may be stored in a directory such as:

/data/hadoop/logs/prod/daily/2013/0123

Packaging

To create a zip package containing all files necessary to run the jobs simply run:

ant zip -Djob.config.dir=<path-to-job-config-dir>

The job.config.dir should be the directory containing the .properties and .job files you would like to include in the package.

If you happen to be using Azkaban as your job scheduler of choice then this zip file will work with it as long as you add the Azkaban specific configuration to base.properties.

Running

After unzipping the zip package you can run using the run.sh script. This requires a couple environment variables to be set:

  • HADOOP_CONF_DIR: Hadoop configuration directory
  • HADOOP_LIB_DIR: Hadoop JARs directory

To run the full job:

./run.sh white-elephant-full-usage.job

To run the incremental job:

./run.sh white-elephant-incremental-usage.job

The incremental job is more efficient as it only processes new data. The full job reprocesses everything.

Contributing

White Elephant is open source and freely available under the Apache 2 license. As always, we welcome contributors, so send us your pull requests.

For help please see the discussion group.

Thanks

White Elephant is built using a suite of great open source projects. Just to name a few:

License

Copyright 2012 LinkedIn, Inc

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

More Repositories

1

hopscotch

A framework to make it easy for developers to add product tours to their pages.
JavaScript
4,200
star
2

LayoutKit

LayoutKit is a fast view layout library for iOS, macOS, and tvOS.
Swift
3,157
star
3

camus

LinkedIn's previous generation Kafka to HDFS pipeline.
Java
883
star
4

indextank-engine

Indexing engine for IndexTank
Java
844
star
5

LIExposeController

Expose style navigation for iOS apps
Objective-C
742
star
6

Selene

iOS library which schedules the execution of tasks on a background fetch
Objective-C
642
star
7

datafu

Hadoop library for large-scale data processing, now an Apache Incubator project
Java
585
star
8

cleo

A flexible, partial, out-of-order and real-time typeahead search library
Java
559
star
9

sensei

distributed realtime searchable database
Java
540
star
10

inject

AMD and CJS dependency management in the browser
JavaScript
464
star
11

indextank-service

The API, BackOffice, Storefront, and Nebulizer for IndexTank
Python
382
star
12

venus.js

where bugs go to die
JavaScript
298
star
13

Fiber

Lightweight JavaScript prototypal inheritance model
JavaScript
279
star
14

sepia

Sepia is a VCR-like module for node.js that records HTTP interactions, then plays them back exactly like the first time they were invoked
JavaScript
278
star
15

JTune

A high precision Java CMS optimizer
Python
271
star
16

scanns

A scalable nearest neighbor search library in Apache Spark
Scala
253
star
17

Cubert

Fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop
Java
247
star
18

naarad

Naarad is a highly configurable system analysis tool that parses and plots timeseries data for better visual correlation. Naarad was built to help in performance analysis and investigations.
Python
239
star
19

simoorg

Failure inducer framework
Python
191
star
20

nginx-config-builder

A python library for building nginx configuration files programatically
Python
170
star
21

Zopkio

A Functional and Performance Test Framework for Distributed Systems
Python
160
star
22

fossor

A plugin-oriented tool for automating the investigation of broken hosts and services.
Python
158
star
23

api-get-started

LinkedIn REST API Getting Started Tutorial
Java
158
star
24

dustjs-helpers

Helpers for dustjs-linkedin
JavaScript
115
star
25

archetype

Archetype is a Compass/Sass based framework for authoring configurable, composable UI components and patterns.
Ruby
102
star
26

Isaac

This library parses data from JSON objects into NSObject models without needing to write parsing code for each model.
Objective-C
97
star
27

linkedin-utils

Base utilities shared by all linkedin open source projects
Java
88
star
28

lafayette

Lafayette is a system to store various email abuse reports sent in ARF.
Python
74
star
29

rest.li-api-hub

API Hub is a web UI for browsing and searching a catalog of Rest.li APIs.
Scala
73
star
30

jaqen

Jaqen - Simple DNS rebinding
Go
70
star
31

Backbone.TableView

Backbone View to render collections as tables
CoffeeScript
70
star
32

linkedin-zookeeper

This project provides utilities and wrappers around ZooKeeper
Java
64
star
33

sometime

A BurpSuite plugin to detect Same Origin Method Execution vulnerabilities
Java
60
star
34

RookBoom

A web application for creating meetings.
Scala
45
star
35

datacl

A collection of efficient utilities for a data scientist.
C
40
star
36

mobster

Mobster is a tool that can help you get deeper understanding into the performance of mobile web applications on real mobile devices
Python
38
star
37

vagrant-autodns

Vagrant plugin for automagically managing guest DNS
Ruby
36
star
38

dmarc-msys

This set of scripts in Lua implements DMARC policy checking and reporting for the Message Systems MTA products, a popular extendable commercial MTA.
Lua
36
star
39

talkin

TalkIn is an interface providing safe and easy unidirectional cross-document communication.
JavaScript
31
star
40

play-testng-plugin

TestNG runner for the Play Framework 2.4
Java
24
star
41

sin

JavaScript
24
star
42

insframe

Central hub for distributing web apps to multiple browsers on multiple environments
JavaScript
22
star
43

Tachyon-iOS

Tachyon provides configurable UI components for iOS that are commonly used in calendar features and applications.
Objective-C
21
star
44

postcss-lang-optimizer

PostCSS plugin to extract language specific CSS rulesets to their own CSS files to optimize stylesheet delivery.
JavaScript
21
star
45

bowser

Extensible language parser with Python-like syntax. Written in Java and antlr.
Java
18
star
46

adfullssl

AdFullSsl is a tool that can automatically detect SSL non-compliant ads and fix them
Python
16
star
47

dustjs-filters-secure

extend dustjs-linkedin to enhance the filters methods
JavaScript
15
star
48

gradle-plugin-insight

Automatic, effortless, accurate documentation for any Gradle plugin
Groovy
13
star
49

timingz.js

Measure code execution in the browser and derive statistical data
JavaScript
13
star
50

Idiomatic-JSLint

JavaScript
12
star
51

streaming

10
star
52

custom-gradle-plugin-portal

An example implementation of a gradle plugin portal.
Java
9
star
53

sbt-restli

A collection of sbt plugins providing build integration for the rest.li REST framework
Scala
9
star
54

PTYHooks

Python
9
star
55

cpr

C++ Requests: Curl for People, a spiritual port of Python Requests
C++
9
star
56

MTBT

Java
9
star
57

inject-bower

Please use linkedin/inject
JavaScript
6
star
58

rest.li-skeleton.g8

Rest.li tool for generating skeleton rest.li projects.
Shell
5
star
59

naarad-examples

Example logs and configs for naarad
3
star
60

html5-presentation

Code for the "Building a Performant HTML5 App" presentation at http://www.meetup.com/SF-Web-Performance-Group/events/71651452/
2
star