terraform-gke-kubeflow-cluster
A Terraform module for creating a GKE cluster to run Kubeflow on.
This module creates a GKE cluster similiar to how the kfctl
tool
does, with a few changes:
- adds a Cloud SQL instance to use for the metadata store/databases
- creates a GCE Persistent Disk to use for the artifact store
This module was originally created by the ML Infrastructure team at Spotify to
create and manage long-lived GKE clusters for many Kubeflow-using teams at
Spotify to use, whereas the kfctl
tool and documentation around creating a
cluster for Kubeflow tends to assume that individual clusters are quickly
spun-up and torn-down by engineers using Kubeflow. For more details on how
Spotify's centralized Kubeflow platform, see this talk from Kubecon North
America 2019.
Usage
To use this within Terraform, add a module
block like:
module "kubeflow-cluster" {
source = "spotify/kubeflow-cluster/gke"
version = "0.0.1"
}
For more details, see https://registry.terraform.io/modules/spotify/kubeflow-cluster/gke/0.0.1
Module details
The terraform-gke-kubeflow-cluster
module creates the following resources:
- a GKE cluster (attached to a Shared VPC if the relevant parameters for networks/subnetworks are set)
- a Cloud SQL instance to use for the metadata store/databases
- a GCE Persistent Disk to use for Argo's artifact store
- GCP service accounts for Kubeflow to use (distinct accounts per cluster):
- an "admin" service account (used for IAP - which is not included in this module)
- the "user" service account for Kubeflow pipelines to use
- the VM service account used by the GKE cluster/nodes itself
- IAM bindings for the above service accounts
- Kubernetes secrets for:
cloudsql-instance-credentials
for the cloudsql-proxy connected to the metadata SQL instanceadmin-gcp-sa
containing the "admin" GCP service account for Kubeflowuser-gcp-sa
containing the "user" GCP service account for Kubeflow
Each "instantiation" of the module creates a new set of all of these resources
- the intent of the module is to automate the entire setup of all of the GCP resources needed to run a Kubeflow cluster.
This repo does not currently actually install the Kubeflow system components on the cluster - use kfctl or another tool for that.
Local development
Run the following commands from the root of the project:
brew install tfenv
-- install tfenvtfenv install
-- install the version of Terraform specified in.terraform-version
in source controlterraform init
-- setup terraform providers
Note on master and node version values
The expected behavior of fuzzy versions for min_master_version
and
node_version
is undocumented (Github issue). From empirical
evidence, the behavior so far is that the most recent version that matches the
fuzzy version is used. For example, node_version = "1.11"
results in GKE
nodes running 1.11.7-gke.6 if that's the most recent version.
Releasing new versions of the module
See https://www.terraform.io/docs/registry/modules/publish.html#releasing-new-versions
A webhook has been automatically added to the repo, and a new "release" will be
visible in the Terraform Registry whenever a new tag is pushed that looks like a
semantic version (e.g. "v1.2.3"). So to cut a release, simply tag a commit and
make sure to push the tag to Github with git push --tags
.
Code of Conduct
This project adheres to the Open Code of Conduct. By participating, you are expected to honor this code.