• Stars
    star
    136
  • Rank 267,670 (Top 6 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created over 12 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The fast and fun way to write YARN applications.

Kitten: For Developers Who Like Playing with YARN

Introduction

Kitten is a set of tools for writing and running applications on YARN, the general-purpose resource scheduling framework that ships with Hadoop 2.2.0. Kitten handles the boilerplate around configuring and launching YARN containers, allowing developers to easily deploy distributed applications that run under YARN.

This link provides a useful overview of what is required to create a new YARN application, and should also help you understand the motivation for creating Kitten.

http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

Build and Installation

To build Kitten, run:

mvn clean install

from this directory. That will build the common, master, and client subprojects.

The java/examples/distshell directory contains an example configuration file that can be used to run the Kitten version of the Distributed Shell example application that ships with Hadoop 2.2.0. To run the example, execute:

hadoop jar kitten-client-0.2.0-jar-with-dependencies.jar distshell.lua distshell

where the jar file is in the java/client/target directory. You should also copy the application master jar file from java/master/target to a directory where it can be referenced from distshell.lua.

Using Kitten

Kitten aims to handle the boilerplate aspects of configuring and launching YARN applications, allowing developers to focus on the logic of their application and not the mechanics of how to deploy it on a Hadoop cluster. It provides two related components that simplify common YARN usage patterns:

  1. A configuration language, based on Lua 5.1, that is used to specify the resources the application needs from the cluster in order to run.
  2. A pair of Guava services, one for the client and one for the application master, that handle all of the RPCs that are executed during the lifecycle of a YARN application, and

Configuration Language

Kitten makes extensive use of Lua's table type to organize information about how a YARN application should be executed. Lua tables combine aspects of arrays and dictionaries into a single data structure:

a = { "this", "is", "a", "lua", "table" }

b = {
  and = "so",
  is = "this",
  ["a.key.with.dots"] = "is allowed using special syntax"
}

The yarn Function

The yarn function is used to check that a Lua table that describes a YARN application contains all of the information that Kitten will require in order to launch the application, as well as providing some convenience functions that minimize how often some configuration information needs to be repeated. Here is how yarn is used to define the distributed shell application:

distshell = yarn {
  name = "Distributed Shell",
  timeout = 10000,
  memory = 512,

  master = {
    env = base_env, -- Defined elsewhere in the file
    command = {
      base = "java -Xmx128m com.cloudera.kitten.appmaster.ApplicationMaster",
      args = { "-conf job.xml" }, -- job.xml contains the client configuration info.
    }
  },

  container = {
    instances = 3,
    env = base_env,  -- Defined elsewhere in the file
    command = "echo 'Hello World!' >> /tmp/hello_world"
  }
}

The yarn function checks for the following fields in the table that is passed to it, optionally setting default values for optional fields that were not specified.

  1. name (string, required): The name of this application.
  2. timeout (integer, defaults to -1): How long the client should wait in milliseconds before killing the application due to a timeout. If < 0, then the client will wait forever.
  3. user (string, defaults to the user executing the client): The user to execute the application as on the Hadoop cluster.
  4. queue (string, defaults to ""): The queue to submit the job to, if the capacity scheduler is enabled on the cluster.
  5. conf (table, optional): A table of key-value pairs that will be added to the Configuration instance that is passed to the launched containers via the job.xml file. The creation of job.xml is built-in to the Kitten framework and is similar to how the MapReduce library uses the Configuration object to pass client-side configuration information to tasks executing on the cluster.

In order to configure the application master and the container tasks, the yarn function checks for the presence of a master field and either a container or containers field. The container field is a shortcut for the case in which there is only one kind of container configuration; otherwise the containers field expects a repeated list of container configurations. The master and container/containers fields take a similar set of fields that specify how to allocate resources and then run a command in the container that was created:

  1. env (table, optional): A table of key-value pairs that will be set as environment variables in the container. Note that if all of the environment variables are the same for the master and container, you can specify the env table once in the yarn table and it will be linked to the subtables by the yarn function.
  2. memory (integer, defaults to 512): The amount of memory to allocate for the container, in megabytes. If the same amount of memory is allocated for both the master and the containers, you can specify the value once inside of the yarn table and it will be linked to the subtables by the yarn function.
  3. cores (integer, defaults to 1): The number of virtual cores to allocate for the container. If the same number of cores are allocated for both the master and the containers, you can specify the value once inside of the yarn table and it will be linked to the subtables by the yarn function.
  4. instances (integer, defaults to 1): The number of instances of this container type to create on the cluster. Note that this only applies to the container/containers arguments; the system will only allocate a single master for each application.
  5. priority (integer, defaults to 0): The relative priority of the containers that are allocated. Note that this prioritization is internal to each application; it does not control how many resources the application is allowed to use or how they are prioritized.
  6. tolerated_failures (integer, defaults to 4): This field is only specified on the application master, and it specifies how many container failures should be tolerated before the application shuts down.
  7. command/commands (string(s) or table(s), optional): command is a shortcut for commands in the case that there is only a single command that needs to be executed within each container. This field can either be a string that will be run as-is, or it may be a table that contains two subfields: a base field that is a string and an args field that is a table. Kitten will construct a command by concatenating the values in the args table to the base string to form the command to execute.
  8. resources (table of tables, optional): The resources (in terms of files, URLs, etc.) that the command needs to run in the container. An outline of the resources fields are given in the following section.

YARN has a mechanism for copying files that are needed by an application to a working directory created for the container that the application will run in. These files are referred to in Kitten as resources. A resource specification might look like this:

master = {
  ...
  resources = {
    ["app.jar"] = {
      file = "/localfs/path/to/local/jar/long-name-for-app.jar",
      type = "file",               -- other value: 'archive'
      visibility = "application",  -- other values: 'private', 'public'
        },

    { hdfs = "/hdfs/path/to/cluster/file/example.txt" },
  }
}

In this specification, we are referencing a jar file that is stored on the local filesystem, and by setting the key to "app.jar", we indicate that the file should be named "app.jar" when it is copied into the container's working directory. Kitten handles the process of copying the local file to HDFS before the job begins so that the file is available to the application. The other resource we are referencing is already stored on HDFS, which is indicated by the use of the hdfs field instead of the file field in the table. Since we did not specify a name for this resource via a key in the table, Kitten uses the file's name (example.txt) as the name for the resource in the working directory.

The cat Function

The cat function takes a table base as its only argument and returns a reference to a function that will copy the key-value pairs of base into any table given to it unless that table already has a value defined for a key that occurs in base. An example will probably help clarify the usage.

base = cat { x = 1, y = 2, z = 3 }  -- 'base' is a reference to a function, not a table.
derived = base { x = 17, y = 29 }   -- 'derived' also has the key-value pair 'z=3' defined in it as well.
base_as_table = base {}             -- 'base_as_table' is a table, not a function.

It is common to have some arguments or environment variables in a YARN application that need to be shared between the application master and the tasks that are launched in containers. cat provides a convenient way to reference this shared information within the configuration file while allowing some extra parameters to be specified:

-- Define the common environment variables once.
base_env = cat { CLASSPATH = "...", JAVA_HOME = "..." }

my_app = yarn {
  master = {
    -- Copy base_env and add an extra setting for the master.
    env = base_env { IS_MASTER = 1 },
  }

  container = {
    -- Copy base_env and add an extra variable for the container nodes.
    env = base_env { IS_MASTER = 0 },
  }
}

Kitten Services

Kitten provides a pair of services that handle all interactions with YARN's ResourceManager: one for the client YarnClientService and one for the application master ApplicationMasterService. These services are implemented via the Service API that is defined in Google's Guava library for Java and manage the cycle of starting a new application, monitoring it while it runs, and then shutting it down and performing any necessary cleanup when the application has finished. Additionally, they provide auxillary functions for handling common tasks during application execution.

The client and master service APIs have a similar design. They both rely on an interface that specifies how to configure various requests that are issued to YARN's ResourceManager. In the case of the client, this is the YarnClientParameters interface, and for the master, it is the ApplicationMasterParameters interface. Kitten ships with implementations of these interfaces that get their values from a combination of the Kitten Lua DSL and an optional map of key-value pairs that are specified in Java and may be used to provide configuration information that is not known until runtime. For example, this map is used to communicate the hostname and port of the master application to the slave nodes that are launched via containers.

Client Services

Kitten ships with a default client application, KittenClient, which is intended to be used with YARN applications that provide their status via a tracking URL and do not require application-specific client interactions. KittenClient has two required arguments: the first should be a path to a .lua file that contains the configuration information for a YARN application, and the second should be the name of one of the tables defined in that file using the yarn function. Additionally, KittenClient implements Hadoop's Tool interface, and so those two required arguments may be preceded by the other Hadoop configuration arguments.

KittenClient can also be used as a basis for your own YARN applications by subclassing KittenClient and overriding the handle method for interacting with the service.

ApplicationMaster Services

Kitten also ships with a default application master, the aptly-named ApplicationMaster, which is primarily intended as an example.

Most real YARN applications will also incorporate some logic for coordinating the slave nodes into their application master binary, as well as for passing additional configuration information to the Lua DSL. The ApplicationMasterParameters interface provides methods that allow the application master to specify the hostname (setHostname), port (setClientPort), and a tracking URL (setTrackingUrl) that will be passed along to the ResourceManager and then to the client.

FAQ

  1. Why Lua as a configuration language?

    Lua's original use case was as a tool for configuring C++ applications, and it has been widely adopted in the gaming community as a simple scripting language. It has a number of desirable properties for the use case of configuring YARN applications, namely:

    1. It integrates well with both Java and C++. We expect to see YARN applications written in both languages, and expect that Kitten will need to support both. Having a single configuration format for both languages reduces the cognitive overhead for developers.
    2. It is a programming language, but not much of one. Lua provides a complete programming environment when you need it, but mainly stays out of your way and lets you focus on configuration.
    3. It tolerates missing values well. It is easy to reference values in a configuration file that may not be defined until much later. For example, we can specify parameters that will eventually contain the value of the master's hostname and port, but are undefined when the client application is initially configured.

    That said, we fully expect that other languages (e.g., Lisp) would make excellent configuration languages for YARN applications, which is why the YarnClientParameters and ApplicationMasterParameters are interfaces: we can swap out other configuration DSLs that may make more sense for certain developers or use cases.

  2. What are your plans for Kitten?

    That's a good question. In the short term, we're primarily interested in writing YARN applications that leverage Kitten, which will give us a chance to fix bugs and add solutions to common design patterns. We expect that Kitten will add functionality over time that makes it easier to handle failures, report internal application state, and provide for dynamically allocating new resources over time. We also expect to spend a fair amount of time adding C++ versions of the client and application master services so that Kitten's DSL could also be used to configure C++ applications to run on YARN.

    Additionally, we could add functionality for configuring all kinds of jobs- like MapReduces, Pig scripts, Hive queries, etc.- as Kitten tasks, using functions similar to yarn. We could also use Kitten to specify DAGs of tasks and treat Kitten as an alternative way of interacting with Oozie's job scheduling functionality.

  3. Were any animals harmed in the development of this library?

    No.

  4. What if I have questions that aren't answered by this (otherwise awesome) FAQ?

    We have mailing lists where you can ask questions, either as a user of Kitten or as a developer of Kitten.

More Repositories

1

hue

Open source SQL Query Assistant service for Databases/Warehouses
JavaScript
1,164
star
2

livy

Livy is an open source REST interface for interacting with Apache Spark from anywhere
Scala
996
star
3

flume

WE HAVE MOVED to Apache Incubator. https://cwiki.apache.org/FLUME/ . Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.
Java
944
star
4

impyla

Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)
Python
730
star
5

cm_api

Cloudera Manager API Client
Java
298
star
6

cdh-twitter-example

Example application for analyzing Twitter data using CDH - Flume, Oozie, Hive
Java
286
star
7

cloudera-playbook

Cloudera deployment automation with Ansible
HTML
198
star
8

cm_ext

Cloudera Manager Extensibility Tools and Documentation.
Java
183
star
9

flink-tutorials

Java
182
star
10

impala-tpcds-kit

TPC-DS Kit for Impala
Smarty
164
star
11

cloudera-scripts-for-log4j

Scripts for addressing log4j zero day security issue
Shell
86
star
12

kudu-examples

Example code for Kudu
78
star
13

python-ngrams

Python
75
star
14

clusterdock

Python
70
star
15

hs2client

C++ native client for Impala and Hive, with Python / pandas bindings
Thrift
69
star
16

impala-udf-samples

Sample UDF and UDAs for Impala.
C++
63
star
17

director-scripts

Cloudera Director sample code
Shell
61
star
18

cm_csds

A collection of Custom Service Descriptors
Shell
54
star
19

bigtop

Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem. The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc...) developed by a community with a focus on the system as a whole, rather than individual projects.
Groovy
50
star
20

CML_AMP_LLM_Chatbot_Augmented_with_Enterprise_Data

Python
49
star
21

cdh-package

Groovy
48
star
22

ades

An analysis of adverse drug event data using Hadoop, R, and Gephi
Java
44
star
23

kafka-examples

Kafka Examples repository.
Scala
43
star
24

mapreduce-tutorial

Java
37
star
25

llama

Llama - Low Latency Application MAster
Java
33
star
26

seismichadoop

System for performing seismic data processing on a Hadoop cluster.
Java
32
star
27

CML_AMP_Anomaly_Detection

Apply modern, deep learning techniques for anomaly detection to identify network intrusions.
Python
30
star
28

mahout

Java
30
star
29

parquet-examples

Example programs and scripts for accessing parquet files
Java
30
star
30

dist_test

HTML
29
star
31

Impala

Real-time Query for Hadoop; mirror of Apache Impala
C++
29
star
32

native-toolchain

Shell
27
star
33

emailarchive

Hadoop for archiving email
Java
24
star
34

dbt-impala

A dbt adapter for Apache Impala & Cloudera Data Platform
Python
24
star
35

cdsw-training

Example Python and R code for Cloudera Data Science Workbench training
Python
23
star
36

navigator-sdk

Navigator SDK
Java
22
star
37

dbt-hive

The dbt-hive adapter allows you to use dbt with Apache Hive and Cloudera Data Platform.
Python
22
star
38

director-sdk

Cloudera Director API clients
Java
17
star
39

thrift_sasl

Thrift SASL module that implements TSaslClientTransport
Python
17
star
40

tutorial-assets

Assets used in Cloudera Tutorials
Python
16
star
41

community-ml-runtimes

Dockerfile
16
star
42

squeasel

C
16
star
43

python-sasl

Python wrapper for Cyrus SASL
C++
16
star
44

cod-examples

cod-examples
Java
16
star
45

sqoop2

Java
15
star
46

CML_AMP_Explainability_LIME_SHAP

Learn how to explain ML models using LIME and SHAP.
Jupyter Notebook
14
star
47

CML_AMP_Few-Shot_Text_Classification

Perform topic classification on news articles in several limited-labeled data regimes.
Jupyter Notebook
14
star
48

earthquake

Java
14
star
49

cmlextensions

Added functionality to the cml python package
Python
14
star
50

ml-runtimes

Dockerfile
13
star
51

CML_AMP_Image_Analysis

Build a semantic search application with deep learning models.
Jupyter Notebook
12
star
52

cloudera-airflow-plugins

Python
12
star
53

CML_AMP_Continuous_Model_Monitoring

Demonstration of how to perform continuous model monitoring on CML using Model Metrics and Evidently.ai dashboards
CSS
12
star
54

strata-tutorial-2016-nyc

Scala
11
star
55

cdp-sdk-java

Cloudera CDP SDK for Java
Java
11
star
56

director-aws-plugin

Cloudera Director - Amazon Web Services integration
Java
11
star
57

logredactor

Java
11
star
58

CML_AMP_Churn_Prediction

Build an scikit-learn model to predict churn using customer telco data.
Jupyter Notebook
11
star
59

phoenix

phoenix
Java
11
star
60

dbt-impala-example

A demo project for dbt-impala adapter for dbt
Python
10
star
61

poisson_sampling

R
10
star
62

cml-training

Example Python and R code for Cloudera Machine Learning (CML) training
R
9
star
63

Applied-ML-Prototypes

9
star
64

director-google-plugin

Cloudera Director - Google Cloud Platform integration
Java
9
star
65

cdpcli

CDP command line interface (CLI)
Python
9
star
66

cdp-dev-docs

cdp-dev-docs
HTML
8
star
67

CML_AMP_Canceled_Flight_Prediction

Perform analytics on a large airline dataset with Spark and build an XGBoost model to predict flight cancellations.
Jupyter Notebook
8
star
68

CML_AMP_Structural_Time_Series

Applying a structural time series approach to California hourly electricity demand data.
Python
8
star
69

director-spi

Cloudera Director Service Provider Interface
Java
8
star
70

CML_AMP_Question_Answering

Explore an emerging NLP capability with WikiQA, an automated question answering system built on top of Wikipedia.
Python
8
star
71

CML_AMP_Intelligent-QA-Chatbot-with-NiFi-Pinecone-and-Llama2

The prototype deploys an Application in CML using a Llama2 model from Hugging Face to answer questions augmented with knowledge extracted from the website. This prototype introduces Pinecone as a database for storing vectors for semantic search.
Python
8
star
72

dbt-hive-example

A sample project for dbt-hive adapter with Cloudera Data Platform
Python
7
star
73

terraform-provider-cdp

terraform-provider-cdp
Go
7
star
74

cmlutils

Python
7
star
75

crcutil

C++
6
star
76

datafu

Java
6
star
77

flink-basic-auth-handler

flink-basic-auth-handler
Java
6
star
78

partner-engineering

Cloudera Partner Engineering Tools
Shell
6
star
79

cybersec

Java
6
star
80

cdpcurl

Curl like tool with CDP request signing.
Python
5
star
81

CML_AMP_MLFlow_Tracking

Experiment tracking with MLFlow.
Python
5
star
82

hcatalog-examples

Sample code for reading and writing tables with hcatalog
Java
5
star
83

CML_AMP_Dask_on_CML

CML_AMP_Dask_on_CML
Jupyter Notebook
5
star
84

CML_AMP_Streamlit_on_CML

Demonstration of how to use Streamlit as a CML Application.
Python
5
star
85

CML_AMP_Video_Classification

Demonstration of how to perform video classification using pre-trained TensorFlow models.
Jupyter Notebook
5
star
86

opdb-docker

Shell
4
star
87

github-jira-gateway

A Grails app to serve as a gateway between an internal GitHub Enterprise server and an external JIRA server
Groovy
4
star
88

blog-eclipse

Perl
4
star
89

CML_llm-hol

Jupyter Notebook
4
star
90

CML_AMP_SpaCy_Entity_Extraction

A Jupyter notebook demonstrating entity extraction on headlines with SpaCy.
Jupyter Notebook
4
star
91

flink-kerberos-auth-handler

flink-kerberos-auth-handler
Java
3
star
92

CML_AMP_Object_Detection_Inference

Interact with a blog-style Streamlit application to visually unpack the inference workflow of a modern, single-stage object detector.
Python
3
star
93

dbt-spark-cde-example

Python
3
star
94

CML_AMP_Intelligent_Writing_Assistance

CML_AMP_Intelligent_Writing_Assistance
Python
3
star
95

dbt-spark-livy-example

dbt-spark-livy-example
Python
3
star
96

CML_AMP_LLM_Fine_Tuning_Studio

Python
3
star
97

CML_AMP_APIv2

Demonstration of how to use the CML API to interact with CML.
Jupyter Notebook
3
star
98

director-azure-plugin

Cloudera Director - Microsoft Azure Integration
Java
2
star
99

observability

Cloudera Observability related artifacts including Grafana charts and Alert definitions
Shell
2
star
100

altus-sdk-java-samples

[EOL] Samples for the Cloudera Altus SDK for Java
Java
2
star