• Stars
    star
    295
  • Rank 140,902 (Top 3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 5 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Cluster Autoscaler for Kubernetes and Mesos

CI Documentation Status

Clusterman - Autoscale and Manage your compute clusters

Clusterman Logo

Clusterman (the Cluster Manager) is an autoscaling engine for Mesos and Kubernetes clusters. It looks at metrics and can launch or terminate compute to meet the needs of your workloads, similarly to the official Kubernetes Cluster Autoscaler It provides the following set of features:

  • Customizable metrics: All metrics for Clusterman are stored in an external datastore, and are automatically loaded into the signals that need them
  • Pluggable autoscaling signals: Your team knows how the application you're running should scale in response to metrics, so your team should own the signal that tells Clusterman what to do
  • Full-featured simulation environment: Want to know how the autoscaler is going to respond to production traffic before you deploy changes? The Clusterman simulation environment lets you do this. You can also simulate future traffic so that you can predict usage or cost increase before they happen.

For more information, see the Clusterman documentation

Getting Started

You can try out Clusterman in a local development environment against a Dockerized Mesos cluster by running the following commands:

make example
clusterman status --cluster local-dev -v

All of the Clusterman CLI commands should work in the above environment. You can see examples of the Clusterman services by running

make itest-external

Components

Architecture Diagram

Clusterman is made up of the following components:

  • Metrics Data Store: All relevant data used by scaling signals is written to a single data store for a single source of truth about historical cluster state. At Yelp, we use AWS DynamoDB for this datastore. Metrics are written to the datastore via a separate metrics library.
  • Pluggable Signals: Metrics (from the data store) are consumed by signals (small bits of code that are used to produce resource requests. Signals run in separate processes configured by supervisord, and use Unix sockets to communicate.
  • Core Autoscaler: The autoscaler logic consumes resource requests from the signals and combines them to determine how many resources to request from or release back to the cloud provider.
  • Resource Groups and Pools: Each autoscaler instance manages exactly one "pool", that is, a logical grouping of machines in a cluster. Pools consist of "resource groups", such as a Spot Fleet Request (SFR) or AutoScaling Group (ASG) from AWS EC2.
  • Configuration: Clusterman stores global configuration values in a file called clusterman.yaml, and per-pool configuration in clusterman-clusters/<cluster-name>/<pool-name>.(mesos|kubernetes). These config files tell the Clusterman services when and how to run, and they serve as the glue to hook up an autoscaler with its signals. Configure the path to clusterman.yaml with the --env-config-path flag, and the path to clusterman-clusters with --cluster-config-directory.
  • An Autoscaling Simulation Environment: Clusterman comes with a complete simulation environment for running tests with your signals on historical data before they are deployed. This environment can produce information about the cost of your cluster, as well as whether it is over- or under-provisioned.

Clusterman has two main ways of interacting with the clusters it manages. The Clusterman CLI provides a set of command-line tools for viewing and managing the state of the cluster; type clusterman --help to see a list of possible subcommands. See the Clusterman documentation for more details.

The Clusterman service runs as a set of three long-running processes; the first process collects data about spot instance pricing from AWS (not required if you aren't using AWS, spot instances, or the Clusterman simulator); the second process queries each of the pools in a cluster to collect metadata and system metrics about the pool; and the third process is responsible for actually autoscaling each of the pools.

Integrating Clusterman

At Yelp, we use PaaSTA, our platform-as-a-service, to manage Clusterman. If you use PaaSTA, setting up Clusterman should be relatively straightforward. Otherwise you will need to provide additional tooling to deploy the Clusterman code or Docker image to your environment.

If you'd like to use Clusterman in your environment, you will need the following components set up:

  • A metrics datastore with the appropriate tables. See examples/terraform/clusterman.tf for a Terraform representation of the schema in DynamoDB.
  • A clusterman_metrics library that can communicate with your chosen metrics datastore. There is a reference copy of the metrics library in examples/clusterman_metrics that is capable of communicating with AWS DynamoDB.
  • Code to run the autoscaler service. At Yelp, we use an internal batch library called yelp_batch for this task; however, the same goal can be achieved by simply running the code in a never-terminating while loop. See the sample code in examples/batch for a place to start.
  • Configuration files. Clusterman uses one "master" configuration file as well as a configuration file per pool that it autoscales. You can see examples of these config files in acceptance/srv-configs, and the config file schema in examples/schemas.

To build a Debian package for the Clusterman CLI, run make package. To build an example Docker image which can run the Clusterman batch code, run make cook-image-external

Clusterman uses EC2 tags in order to find the resource groups that it manages. To configure a resource group so that Clusterman can find it, you need to add a tag like the following to your ASG or SFR:

tag-name: "{\"paasta_cluster\": \"cluster-name\", \"pool\": \"pool-name\"}"

You can specify the value of tag-name in your configuration file for the pool:

resource_groups:
  - (sfr|asg):
    tag: tag-name

Design Goals

Clusterman is designed to support a wide range of cluster autoscaling needs at Yelp. We run many different types of workloads (long-running services, batch jobs, machine learning tasks, databases, etc.) on top of Kubernetes and Mesos, and each of these workloads has a different set of scaling requirements. Clusterman is designed to be a unified system that can accomodate each of these workloads. To that end, Clusterman's design goals are:

  • A modular design that separates cloud API calls from signal evaluation and the core autoscaling loop
  • Unified autoscaling logic for a multi-tenant cluster
  • Client-owned scaling signals for requesting resources
  • A command-line interface for managing and interacting with clusters
  • A simulation environment for performing cost and behaviour analysis

Licence

Clusterman is licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

Contributing

Everyone is encouraged to contribute to Clusterman by forking the Github repository and making a pull request or opening an issue. Please read our Code of Conduct.

Instructions for Yelp developers

  1. Make your changes, push a branch to GitHub, and create a pull request
  2. Once your PR is approved, merge your changes to master

A Jenkins pipeline polls GitHub and brings any changes into our internal version. Jenkins will then build and deploy Clusterman as normal.

More Repositories

1

elastalert

Easy & Flexible Alerting With ElasticSearch
Python
7,926
star
2

dumb-init

A minimal init system for Linux containers
Python
6,806
star
3

detect-secrets

An enterprise friendly way of detecting and preventing secrets in code.
Python
3,704
star
4

mrjob

Run MapReduce jobs on Hadoop or Amazon Web Services
Python
2,615
star
5

osxcollector

A forensic evidence collection & analysis toolkit for OS X
Python
1,858
star
6

paasta

An open, distributed platform as a service
Python
1,681
star
7

undebt

A fast, straightforward, reliable tool for performing massive, automated code refactoring
Python
1,634
star
8

MOE

A global, black box optimization engine for real world metric optimization.
C++
1,306
star
9

dockersh

A shell which places users into individual docker containers
Go
1,282
star
10

dataset-examples

Samples for users of the Yelp Academic Dataset
Python
1,189
star
11

yelp.github.io

A showcase of projects we've open sourced and open source projects we use
JavaScript
701
star
12

bravado

Bravado is a python client library for Swagger 2.0 services
Python
603
star
13

yelp-api

Examples of code using our v2 API
PHP
580
star
14

service-principles

A guide to service principles at Yelp for our service oriented architecture
423
star
15

swagger-gradle-codegen

💫 A Gradle Plugin to generate your networking code from Swagger
Kotlin
413
star
16

mysql_streamer

MySQLStreamer is a database change data capture and publish system.
Python
409
star
17

pyleus

Pyleus is a Python framework for developing and launching Storm topologies.
Python
406
star
18

yelp-fusion

Yelp Fusion API
Python
401
star
19

docker-custodian

Keep docker hosts tidy
Python
355
star
20

android-school

The best videos from the Android community and beyond
350
star
21

Tron

Next generation batch process scheduling and management
Python
340
star
22

kafka-utils

Python
313
star
23

bento

DEPRECATED - A delicious framework for building modularized Android user interfaces, by Yelp.
Kotlin
306
star
24

Testify

A more pythonic testing framework.
Python
303
star
25

kotlin-android-workshop

A Kotlin Workshop for engineers familiar with Java and Android development.
Kotlin
288
star
26

threat_intel

Threat Intelligence APIs
Python
264
star
27

nrtsearch

A high performance gRPC server on top of Apache Lucene
Java
254
star
28

python-gearman

Gearman API - Client, worker, and admin client interfaces
Python
242
star
29

py_zipkin

Provides utilities to facilitate the usage of Zipkin in Python
Python
225
star
30

fuzz-lightyear

A pytest-inspired, DAST framework, capable of identifying vulnerabilities in a distributed, micro-service ecosystem through chaos engineering testing and stateful, Swagger fuzzing.
Python
205
star
31

yelp-python

A Python library for the Yelp API
Python
182
star
32

venv-update

Synchronize your virtualenv quickly and exactly.
Python
178
star
33

firefly

Firefly is a web application aimed at powerful, flexible time series graphing for web developers.
JavaScript
171
star
34

amira

AMIRA: Automated Malware Incident Response & Analysis
Python
150
star
35

aactivator

Automatically source and unsource a project's environment
Python
145
star
36

YLTableView

Objective-C
144
star
37

love

A system to share your appreciation
Python
142
star
38

lemon-reset

Consistent, cross-browser React DOM tags, powered by CSS Modules. 🍋
JavaScript
131
star
39

dataloader-codegen

🤖 dataloader-codegen is an opinionated JavaScript library for automatically generating DataLoaders over a set of resources (e.g. HTTP endpoints).
TypeScript
110
star
40

bravado-core

Python
109
star
41

data_pipeline

Data Pipeline Clientlib provides an interface to tail and publish to data pipeline topics.
Python
109
star
42

detect-secrets-server

Python
108
star
43

yelp-ruby

A Ruby gem for communicating with the Yelp REST API
Ruby
105
star
44

swagger_spec_validator

Python
104
star
45

ybinlogp

A fast mysql binlog parser
C
97
star
46

beans

Bringing people together, one cup of coffee at a time
Python
93
star
47

casper

A fast web application platform built in Rust and Luau
Rust
90
star
48

schematizer

A schema store service that tracks and manages all the schemas used in the Data Pipeline
Python
86
star
49

requirements-tools

requirements-tools contains scripts for working with Python requirements, primarily in applications.
Python
81
star
50

osxcollector_output_filters

Filters that process and transform the output of osxcollector
Python
77
star
51

sensu_handlers

Custom Sensu Handlers to support a multi-tenant environment, allowing checks themselves to emit the type of handler behavior they need in the event json
Ruby
75
star
52

graphql-guidelines

GraphQL @ Yelp Schema Guidelines
Makefile
74
star
53

kegmate

Arduino/iPad powered kegerator
Objective-C
72
star
54

ephemeral-port-reserve

Find an unused port, reliably
Python
68
star
55

parcelgen

Helpful tool to make data objects easier for Android
Python
65
star
56

salsa

A tool for exporting iOS components into Sketch 📱💎
Swift
62
star
57

yelp-ios

Objective-C
61
star
58

docker-observium

Observium docker image with both professional and community edition support, ldap auth, and easy plugin support.
ApacheConf
58
star
59

yelp-android

Java
55
star
60

terraform-provider-signalform

SignalForm is a terraform provider to codify SignalFx detectors, charts and dashboards
Go
44
star
61

mycroft

Python
42
star
62

pidtree-bcc

eBPF tool for logging process ancestry of outbound TCP connections
Python
41
star
63

terraform-provider-gitfile

Terraform provider for checking out git repositories and making changes
Go
40
star
64

ffmpeg-android

Shell
39
star
65

pushmanager

Pushmanager is a web application to manage source code deployments.
Python
38
star
66

zygote

A Python HTTP process management utility.
Python
38
star
67

yelp_kafka

An extension of the kafka-python package that adds features like multiprocess consumers.
Python
38
star
68

pgctl

Manage sets of developer services -- "playground control"
Python
31
star
69

EMRio

Elastic MapReduce instance optimizer
Python
31
star
70

s3mysqldump

Dump mysql tables to s3, and parse them
Python
31
star
71

android-varanus

A client-side Android library to monitor and limit network traffic sent by your apps
Kotlin
29
star
72

pyramid_zipkin

Pyramid tween to add Zipkin service spans
Python
29
star
73

puppet-netstdlib

A collection of Puppet functions for interacting with the network
Ruby
27
star
74

sqlite3dbm

sqlite-backed dictionary conforming to the dbm interface
Python
27
star
75

send_nsca

Pure-python NSCA client
Python
26
star
76

docker-push-latest-if-changed

Python
26
star
77

data_pipeline_avro_util

Provides a Pythonic interface for reading and writing Avro schemas
Python
26
star
78

cocoapods-readonly

Automatically locks all CocoaPod source files.
Ruby
26
star
79

uwsgi_metrics

Python
26
star
80

WebImageView

An enhanced and improved ImageView for Android that displays images loaded over the interwebs
Java
25
star
81

task_processing

Interfaces and shared infrastructure for generic task processing at Yelp.
Python
23
star
82

PushmasterApp

(Legacy) Yelp pushmaster application built on Google App Engine
Python
22
star
83

tlspretense-service

A Docker container that exposes tlspretense on a port.
Makefile
20
star
84

puppet-uchiwa

Puppet module for installing Uchiwa
Ruby
20
star
85

yelp_cheetah

cheetah, hacked by yelpers
Python
20
star
86

logfeeder

Python
20
star
87

fido

Asynchronous HTTP client built on top of Crochet and Twisted
Python
20
star
88

swagger-spec-compatibility

Python library to check Swagger Spec backward compatibility
Python
20
star
89

pyramid-hypernova

A Python client for Airbnb's Hypernova server, for use with the Pyramid web framework.
Python
19
star
90

mr3po

protocols for use with mrjob
Python
16
star
91

YPFastDateParser

A class for parsing strings into NSDate instances, several times faster than NSDateFormatter
Objective-C
15
star
92

yelp_uri

Utilities for dealing with URIs, invented and maintained by Yelp.
Python
14
star
93

pysensu-yelp

A Python library to emit Sensu events that the Yelp Sensu Handlers can understand for Self-Service Sensu Monitoring
Python
14
star
94

terraform-provider-cloudhealth

Terraform provider for Cloudhealth
Go
14
star
95

yelp-rails-example

An example Rails application that uses the Yelp gem to integrate with the API
Ruby
13
star
96

named_decorator

Dynamically name wrappers based on their callees to untangle profiles of large python codebases
Python
12
star
97

pt-online-schema-change-plugins

Perl
11
star
98

environment_tools

Tools for programmatically describing Yelp's different environments (prod, dev, stage)
Python
11
star
99

puppet-cron

A super great cron Puppet module with timeouts, locking, monitoring, and more!
Ruby
11
star
100

pyswf

Python
10
star