• Stars
    star
    198
  • Rank 196,898 (Top 4 %)
  • Language
    HTML
  • License
    Apache License 2.0
  • Created about 8 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Cloudera deployment automation with Ansible

Cloudera Playbook

An Ansible Playbook that installs the Cloudera stack on RHEL/CentOS

Running the playbook

Please do not use Ansible 2.9.0. This version has an issue with templating which causes the playbook execution to fail. Instead, use any 2.8.x version or a later 2.9.x version as these are not affected.

  • Create Ansible configuration (optional):
$ vi ~/.ansible.cfg

[defaults]
# disable key check if host is not initially in 'known_hosts'
host_key_checking = False

[ssh_connection]
# if True, make ansible use scp if the connection type is ssh (default is sftp)
scp_if_ssh = True
$ vi ~/ansible_hosts

[scm_server]
host1.example.com

[db_server]
host2.example.com

[krb5_server]
host3.example.com

[utility_servers:children]
scm_server
db_server
krb5_server

[edge_servers]
host4.example.com        host_template=HostTemplate-Edge role_ref_names=HDFS-HTTPFS-1

[master_servers]
host5.example.com        host_template=HostTemplate-Master1
host6.example.com        host_template=HostTemplate-Master2
host7.example.com        host_template=HostTemplate-Master3

[worker_servers]
host8.example.com
host9.example.com
host10.example.com

[worker_servers:vars]
host_template=HostTemplate-Workers

[cdh_servers:children]
utility_servers
edge_servers
master_servers
worker_servers

Important: fully qualified domain name (FQDN) is mandatory in the ansible_hosts file

  • Run playbook
$ ansible-playbook -i ~/ansible_hosts cloudera-playbook/site.yml
    
-i INVENTORY
   inventory host path or comma separated host list (default=/etc/ansible/hosts)

Ansible communicates with the hosts defined in the inventory over SSH. It assumes you’re using SSH keys to authenticate so your public SSH key should exist in authorized_keys on those hosts. Your user will need sudo privileges to install the required packages.

By default Ansible will connect to the remote hosts using the current user (as SSH would). To override the remote user name you can specify the --user option in the command, or add the following variables to the inventory:

[all:vars]
ansible_user=ec2-user

AWS users can use Ansible’s --private-key option to authenticate using a PEM file instead of SSH keys.

Enabling Kerberos

The playbook can install a local MIT KDC and configure Hadoop Security. To enable Hadoop Security:

  • Specify the '[krb5_server]' host in the inventory (see above)
  • Set 'krb5_kdc_type' to 'MIT KDC' in group_vars/krb5_server.yml

Overriding CDH service/role configuration

The playbook uses Cloudera Manager Templates to provision a cluster. As part of the template import process Cloudera Manager applies Autoconfiguration rules that set properties such as memory and CPU allocations for various roles.

If the cluster has different hardware or operational requirements then you can override these properties in group_vars/cdh_servers. For example:

cdh_services:
  - type: hdfs        
    datanode_java_heapsize: 10737418240

These properties get added as variables to the rendered template's instantiator block and can be referenced from the service configs. For example roles/cdh/templates/hdfs.j2:

"roleType": "DATANODE",
"configs": [{
  "name": "datanode_java_heapsize",
  "variable": "DATANODE_JAVA_HEAPSIZE"
}

Dynamic Inventory Script for Cloudera Manager

To make integration easier, Gabor Roczei created a dynamic inventory script that allows Ansible to gather data from Cloudera Manager. Its main advantages are:

  • Cache management of inventory for better performance
  • Cloudera Manager’s HTTP cookie handling
  • Support for multiple Cloudera Manager instances
  • SSL-friendly, as the root CA check of the Cloudera Manager server can be disabled or enabled

High level architecture of Ansible dynamic inventory vs. Cloudera Managers
High level architecture of Ansible dynamic inventory vs. Cloudera Managers

Configuration

Step 1: Configuration of the related Cloudera Manager(s)

$ export CM_URL=https://cm1.example.com:7183,https://cm2.example.com:7183
$ export CM_USERNAME=username

Other optional configuration parameters:

$ export CM_CACHE_TIME_SEC=3600
$ export CM_DISABLE_CA_CHECK=True
$ export CM_TIMEOUT_SEC=60
$ export CM_DEBUG=False

Note: We recommend adding these environment variables to the startup file of your shell. For example: $HOME/.bashrc

Step 2: Installation of the git package:

# yum install git

Step 3: Installation of the Ansible package:

# yum install ansible

Step 4: Clone the cloudera-playbook git repository:

$ git clone https://github.com/cloudera/cloudera-playbook

Note: The cloudera-playbook git repository is not officially supported by Cloudera, but the authors recommend using it.

Step 5: Setup the default Ansible inventory and other useful Ansible parameters:

$ vi $HOME/.ansible.cfg
[defaults]
# Python 2 version:
inventory = $HOME/cloudera-playbook/dynamic_inventory_cm_py2
# Python 3 version:
# inventory = $HOME/cloudera-playbook/dynamic_inventory_cm_py3
# Do not gather the host information (facts) by default. This can give significant speedups for large clusters.
# gathering = explicit
# Disable key check if host is not initially in 'known_hosts'
host_key_checking = False
[ssh_connection]
# If it is True, make ansible use scp if the connection type is ssh (default is sftp)
scp_if_ssh = True

Note: Update the inventory path of the dynamic_inventory_cm_py2 if necessary

Step 6: Change the working directory to cloudera-playbook

$ cd cloudera-playbook

Step 7: The available Cloudera Manager clusters (Ansible groups, such as Cluster_1, Balaton) can be listed with the following command:

$ ./dynamic_inventory_cm_py2 --list

Note: The cache of the Cloudera Manager inventory can be refreshed with the following command:

$ ./dynamic_inventory_cm_py2 --refresh-cache

Step 8: Setup the SSH public key authentication for remote hosts:

The big advantage of this is that with ad-hoc commands, you do not need to enter your password each time you run the command, but only the first time you enter the private key password.

If the ~/.ssh/id_rsa.pub and ~/.ssh/id_rsa files do not exist, they need to be generated with the ssh-keygen command prior to connecting to the managed hosts.

Launch a subshell with the following command:

$ ssh-agent bash

You will execute the rest of the commands in this How-To article in the subshell.

Add the SSH private key into the SSH authentication agent:

$ ssh-add ~/.ssh/id_rsa

Validate:

$ ssh-add -L

Upload the SSH public key (id_rsa.pub) to the managed hosts:

$ ansible all -m authorized_key -a key="{{ lookup('file', '~/.ssh/id_rsa.pub') }} user=$USER" --ask-pass -u $USER --become-user $USER

For example, you can use the root user:

$ ansible all -m authorized_key -a key="{{ lookup('file', '~/.ssh/id_rsa.pub') }} user=root" --ask-pass -u root

Note: If you do not want to use SSH public key authentication, add the --ask-pass parameter each time you run the Ansible command.

Step 9: Test remote host connectivity (optional):

$ ansible all -m ping -u $USER --become-user $USER

For example, you can execute the command with the root user:

$ ansible all -m ping -u root

Step 10: The ad-hoc command feature enables running single and arbitrary Linux commands on all hosts. You can use this to troubleshoot slow group resolution issues.

The following commands are example ad-hoc commands where Balaton is a group of hosts that is a cluster in Cloudera Manager:

$ ansible Balaton -u $USER --become-user $USER -m command -o -a "time id -Gn $USER" 
$ ansible all -u $USER --become-user $USER -m command -o -a "date"

The following example uses the root user:

$ ansible Balaton -m command -o -a "time id -Gn testuser" -u root
$ ansible all -m command -o -a "date" -u root

Further information about dynamic inventory and ad-hoc commands can be found in the Ansible documentation:

SSSD setup with Ansible (applicable for RHEL 7 / CentOS 7)

Cloudera blog articles:

Step 1: Edit the default variables in group_vars/all:

krb5_realm: AD.SEC.EXAMPLE.COM
krb5_kdc_type: Active Directory
krb5_kdc_host: w2k8-1.ad.sec.example.com
ad_domain: "{{ krb5_realm.lower() }}"
computer_ou: ou=computer_hosts,ou=hadoop_prd,dc=ad,dc=sec,dc=example,dc=com
ldap_group_search_base: OU=groups,OU=hadoop_prd,DC=ad,DC=sec,DC=example,DC=com
ldap_user_search_base: DC=ad,DC=sec,DC=example,DC=com?subtree?(memberOf=CN=hadoop_users,OU=groups,OU=hadoop_prd,DC=ad,DC=sec,DC=example,DC=com)
override_gid: 999999
ad_site: Default-First-Site-Name

Step 2: Enable kerberos on the hosts:

If necessary, update this template file (See the Ansible Templating (Jinja2) documentation for more information):

templates/krb5.conf.j2

Run this command to apply it on the managed hosts:

$ ansible-playbook --tags krb5_client -u root site.yml

Step 3: Join the host(s) to realm:

If necessary, update these template files (See the Ansible Templating (Jinja2) documentation for more information):

roles/realm/join/templates/sssd.conf.j2
roles/realm/join/templates/realmd.conf.j2
roles/realm/join/templates/nscd.conf.j2

Run this command to apply it on all managed hosts:

$ ansible-playbook -u root realm_join.yaml
bind user: administrator
bind password:

Run this command to apply it on a cluster (for example: Balaton) (See the Ansible Best Practices documentation for more information):

$ ansible-playbook --limit Balaton -u root realm_join.yaml
bind user: administrator
bind password:

Remove all hosts from the realm with this command:

$ ansible-playbook -u root realm_leave.yaml

Remove the Balaton hosts from the realm with this command (See the Ansible Best Practices documentation for more information):

$ ansible-playbook --limit Balaton -u root realm_leave.yaml

How do I contribute code?

You need to first sign and return an ICLA and CCLA before we can accept and redistribute your contribution. Once these are submitted you are free to start contributing to cloudera-playbook. Submit these to [email protected].

Main steps

  • Fork the repo and create a topic branch
  • Push commits to your repo
  • Create a pull request!

Find

We use Github issues to track bugs for this project. Find an issue that you would like to work on (or file one if you have discovered a new issue!). If no-one is working on it, assign it to yourself only if you intend to work on it shortly.

Fix

Please write a good, clear commit message, with a short, descriptive title and a message that is exactly long enough to explain what the problem was, and how it was fixed.

License

Apache License, Version 2.0

More Repositories

1

hue

Open source SQL Query Assistant service for Databases/Warehouses
JavaScript
1,164
star
2

livy

Livy is an open source REST interface for interacting with Apache Spark from anywhere
Scala
996
star
3

flume

WE HAVE MOVED to Apache Incubator. https://cwiki.apache.org/FLUME/ . Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.
Java
944
star
4

impyla

Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)
Python
730
star
5

cm_api

Cloudera Manager API Client
Java
298
star
6

cdh-twitter-example

Example application for analyzing Twitter data using CDH - Flume, Oozie, Hive
Java
286
star
7

cm_ext

Cloudera Manager Extensibility Tools and Documentation.
Java
183
star
8

flink-tutorials

Java
182
star
9

impala-tpcds-kit

TPC-DS Kit for Impala
Smarty
164
star
10

kitten

The fast and fun way to write YARN applications.
Java
136
star
11

cloudera-scripts-for-log4j

Scripts for addressing log4j zero day security issue
Shell
86
star
12

kudu-examples

Example code for Kudu
78
star
13

python-ngrams

Python
75
star
14

clusterdock

Python
70
star
15

hs2client

C++ native client for Impala and Hive, with Python / pandas bindings
Thrift
69
star
16

impala-udf-samples

Sample UDF and UDAs for Impala.
C++
63
star
17

director-scripts

Cloudera Director sample code
Shell
61
star
18

cm_csds

A collection of Custom Service Descriptors
Shell
54
star
19

bigtop

Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem. The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc...) developed by a community with a focus on the system as a whole, rather than individual projects.
Groovy
50
star
20

CML_AMP_LLM_Chatbot_Augmented_with_Enterprise_Data

Python
49
star
21

cdh-package

Groovy
48
star
22

ades

An analysis of adverse drug event data using Hadoop, R, and Gephi
Java
44
star
23

kafka-examples

Kafka Examples repository.
Scala
43
star
24

mapreduce-tutorial

Java
37
star
25

llama

Llama - Low Latency Application MAster
Java
33
star
26

seismichadoop

System for performing seismic data processing on a Hadoop cluster.
Java
32
star
27

CML_AMP_Anomaly_Detection

Apply modern, deep learning techniques for anomaly detection to identify network intrusions.
Python
30
star
28

mahout

Java
30
star
29

parquet-examples

Example programs and scripts for accessing parquet files
Java
30
star
30

dist_test

HTML
29
star
31

Impala

Real-time Query for Hadoop; mirror of Apache Impala
C++
29
star
32

native-toolchain

Shell
27
star
33

emailarchive

Hadoop for archiving email
Java
24
star
34

dbt-impala

A dbt adapter for Apache Impala & Cloudera Data Platform
Python
24
star
35

cdsw-training

Example Python and R code for Cloudera Data Science Workbench training
Python
23
star
36

navigator-sdk

Navigator SDK
Java
22
star
37

dbt-hive

The dbt-hive adapter allows you to use dbt with Apache Hive and Cloudera Data Platform.
Python
22
star
38

director-sdk

Cloudera Director API clients
Java
17
star
39

thrift_sasl

Thrift SASL module that implements TSaslClientTransport
Python
17
star
40

tutorial-assets

Assets used in Cloudera Tutorials
Python
16
star
41

community-ml-runtimes

Dockerfile
16
star
42

squeasel

C
16
star
43

python-sasl

Python wrapper for Cyrus SASL
C++
16
star
44

cod-examples

cod-examples
Java
16
star
45

sqoop2

Java
15
star
46

CML_AMP_Explainability_LIME_SHAP

Learn how to explain ML models using LIME and SHAP.
Jupyter Notebook
14
star
47

CML_AMP_Few-Shot_Text_Classification

Perform topic classification on news articles in several limited-labeled data regimes.
Jupyter Notebook
14
star
48

earthquake

Java
14
star
49

cmlextensions

Added functionality to the cml python package
Python
14
star
50

ml-runtimes

Dockerfile
13
star
51

CML_AMP_Image_Analysis

Build a semantic search application with deep learning models.
Jupyter Notebook
12
star
52

cloudera-airflow-plugins

Python
12
star
53

CML_AMP_Continuous_Model_Monitoring

Demonstration of how to perform continuous model monitoring on CML using Model Metrics and Evidently.ai dashboards
CSS
12
star
54

strata-tutorial-2016-nyc

Scala
11
star
55

cdp-sdk-java

Cloudera CDP SDK for Java
Java
11
star
56

director-aws-plugin

Cloudera Director - Amazon Web Services integration
Java
11
star
57

logredactor

Java
11
star
58

CML_AMP_Churn_Prediction

Build an scikit-learn model to predict churn using customer telco data.
Jupyter Notebook
11
star
59

phoenix

phoenix
Java
11
star
60

dbt-impala-example

A demo project for dbt-impala adapter for dbt
Python
10
star
61

poisson_sampling

R
10
star
62

cml-training

Example Python and R code for Cloudera Machine Learning (CML) training
R
9
star
63

Applied-ML-Prototypes

9
star
64

director-google-plugin

Cloudera Director - Google Cloud Platform integration
Java
9
star
65

cdpcli

CDP command line interface (CLI)
Python
9
star
66

cdp-dev-docs

cdp-dev-docs
HTML
8
star
67

CML_AMP_Canceled_Flight_Prediction

Perform analytics on a large airline dataset with Spark and build an XGBoost model to predict flight cancellations.
Jupyter Notebook
8
star
68

CML_AMP_Structural_Time_Series

Applying a structural time series approach to California hourly electricity demand data.
Python
8
star
69

director-spi

Cloudera Director Service Provider Interface
Java
8
star
70

CML_AMP_Question_Answering

Explore an emerging NLP capability with WikiQA, an automated question answering system built on top of Wikipedia.
Python
8
star
71

CML_AMP_Intelligent-QA-Chatbot-with-NiFi-Pinecone-and-Llama2

The prototype deploys an Application in CML using a Llama2 model from Hugging Face to answer questions augmented with knowledge extracted from the website. This prototype introduces Pinecone as a database for storing vectors for semantic search.
Python
8
star
72

dbt-hive-example

A sample project for dbt-hive adapter with Cloudera Data Platform
Python
7
star
73

terraform-provider-cdp

terraform-provider-cdp
Go
7
star
74

cmlutils

Python
7
star
75

crcutil

C++
6
star
76

datafu

Java
6
star
77

flink-basic-auth-handler

flink-basic-auth-handler
Java
6
star
78

partner-engineering

Cloudera Partner Engineering Tools
Shell
6
star
79

cybersec

Java
6
star
80

cdpcurl

Curl like tool with CDP request signing.
Python
5
star
81

CML_AMP_MLFlow_Tracking

Experiment tracking with MLFlow.
Python
5
star
82

hcatalog-examples

Sample code for reading and writing tables with hcatalog
Java
5
star
83

CML_AMP_Dask_on_CML

CML_AMP_Dask_on_CML
Jupyter Notebook
5
star
84

CML_AMP_Streamlit_on_CML

Demonstration of how to use Streamlit as a CML Application.
Python
5
star
85

CML_AMP_Video_Classification

Demonstration of how to perform video classification using pre-trained TensorFlow models.
Jupyter Notebook
5
star
86

opdb-docker

Shell
4
star
87

github-jira-gateway

A Grails app to serve as a gateway between an internal GitHub Enterprise server and an external JIRA server
Groovy
4
star
88

blog-eclipse

Perl
4
star
89

CML_llm-hol

Jupyter Notebook
4
star
90

CML_AMP_SpaCy_Entity_Extraction

A Jupyter notebook demonstrating entity extraction on headlines with SpaCy.
Jupyter Notebook
4
star
91

flink-kerberos-auth-handler

flink-kerberos-auth-handler
Java
3
star
92

CML_AMP_Object_Detection_Inference

Interact with a blog-style Streamlit application to visually unpack the inference workflow of a modern, single-stage object detector.
Python
3
star
93

dbt-spark-cde-example

Python
3
star
94

CML_AMP_Intelligent_Writing_Assistance

CML_AMP_Intelligent_Writing_Assistance
Python
3
star
95

dbt-spark-livy-example

dbt-spark-livy-example
Python
3
star
96

CML_AMP_LLM_Fine_Tuning_Studio

Python
3
star
97

CML_AMP_APIv2

Demonstration of how to use the CML API to interact with CML.
Jupyter Notebook
3
star
98

director-azure-plugin

Cloudera Director - Microsoft Azure Integration
Java
2
star
99

observability

Cloudera Observability related artifacts including Grafana charts and Alert definitions
Shell
2
star
100

altus-sdk-java-samples

[EOL] Samples for the Cloudera Altus SDK for Java
Java
2
star