• Stars
    star
    364
  • Rank 112,656 (Top 3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 4 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The TensorFlow Cloud repository provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud.

TensorFlow Cloud

The TensorFlow Cloud repository provides APIs that will allow to easily go from debugging, training, tuning your Keras and TensorFlow code in a local environment to distributed training/tuning on Cloud.

Introduction

TensorFlow Cloud run API for GCP training/tuning

Installation

Requirements

For detailed end to end setup instructions, please see Setup instructions.

Install latest release

pip install -U tensorflow-cloud

Install from source

git clone https://github.com/tensorflow/cloud.git
cd cloud
pip install src/python/.

High level overview

TensorFlow Cloud package provides the run API for training your models on GCP. To start, let's walk through a simple workflow using this API.

  1. Let's begin with a Keras model training code such as the following, saved as mnist_example.py.

    import tensorflow as tf
    
    (x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
    
    x_train = x_train.reshape((60000, 28 * 28))
    x_train = x_train.astype('float32') / 255
    
    model = tf.keras.Sequential([
      tf.keras.layers.Dense(512, activation='relu', input_shape=(28 * 28,)),
      tf.keras.layers.Dropout(0.2),
      tf.keras.layers.Dense(10, activation='softmax')
    ])
    
    model.compile(loss='sparse_categorical_crossentropy',
                  optimizer=tf.keras.optimizers.Adam(),
                  metrics=['accuracy'])
    
    model.fit(x_train, y_train, epochs=10, batch_size=128)
  2. After you have tested this model on your local environment for a few epochs, probably with a small dataset, you can train the model on Google Cloud by writing the following simple script scale_mnist.py.

    import tensorflow_cloud as tfc
    tfc.run(entry_point='mnist_example.py')

    Running scale_mnist.py will automatically apply TensorFlow one device strategy and train your model at scale on Google Cloud Platform. Please see the usage guide section for detailed instructions and additional API parameters.

  3. You will see an output similar to the following on your console. This information can be used to track the training job status.

    user@desktop$ python scale_mnist.py
    Job submitted successfully.
    Your job ID is:  tf_cloud_train_519ec89c_a876_49a9_b578_4fe300f8865e
    Please access your job logs at the following URL:
    https://console.cloud.google.com/mlengine/jobs/tf_cloud_train_519ec89c_a876_49a9_b578_4fe300f8865e?project=prod-123

Setup instructions

End to end instructions to help set up your environment for Tensorflow Cloud. You use one of the following notebooks to setup your project or follow the instructions below.

Colab logoRun in Colab GitHub logoView on GitHub Kaggle logoRun in Kaggle
  1. Create a new local directory

    mkdir tensorflow_cloud
    cd tensorflow_cloud
  2. Make sure you have python >= 3.6

    python -V
  3. Set up virtual environment

    virtualenv tfcloud --python=python3
    source tfcloud/bin/activate
  4. Set up your Google Cloud project

    Verify that gcloud sdk is installed.

    which gcloud

    Set default gcloud project

    export PROJECT_ID=<your-project-id>
    gcloud config set project $PROJECT_ID
  5. Authenticate your GCP account

    Create a service account.

    export SA_NAME=<your-sa-name>
    gcloud iam service-accounts create $SA_NAME
    gcloud projects add-iam-policy-binding $PROJECT_ID \
        --member serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com \
        --role 'roles/editor'

    Create a key for your service account.

    gcloud iam service-accounts keys create ~/key.json --iam-account $SA_NAME@$PROJECT_ID.iam.gserviceaccount.com

    Create the GOOGLE_APPLICATION_CREDENTIALS environment variable.

    export GOOGLE_APPLICATION_CREDENTIALS=~/key.json
  6. Create a Cloud Storage bucket. Using Google Cloud build is the recommended method for building and publishing docker images, although we optionally allow for local docker daemon process depending on your specific needs.

    BUCKET_NAME="your-bucket-name"
    REGION="us-central1"
    gcloud auth login
    gsutil mb -l $REGION gs://$BUCKET_NAME

    (optional for local docker setup) shell sudo dockerd

  7. Authenticate access to Google Cloud registry.

    gcloud auth configure-docker
  8. Install nbconvert if you plan to use a notebook file entry_point as shown in usage guide #4.

    pip install nbconvert
  9. Install latest release of tensorflow-cloud

    pip install tensorflow-cloud

Usage guide

As described in the high level overview, the run API allows you to train your models at scale on GCP. The run API can be used in four different ways. This is defined by where you are running the API (Terminal vs IPython notebook), and your entry_point parameter. entry_point is an optional Python script or notebook file path to the file that contains your TensorFlow Keras training code. This is the most important parameter in the API.

run(entry_point=None,
    requirements_txt=None,
    distribution_strategy='auto',
    docker_config='auto',
    chief_config='auto',
    worker_config='auto',
    worker_count=0,
    entry_point_args=None,
    stream_logs=False,
    job_labels=None,
    **kwargs)
  1. Using a python file as entry_point.

    If you have your tf.keras model in a python file (mnist_example.py), then you can write the following simple script (scale_mnist.py) to scale your model on GCP.

    import tensorflow_cloud as tfc
    tfc.run(entry_point='mnist_example.py')

    Please note that all the files in the same directory tree as entry_point will be packaged in the docker image created, along with the entry_point file. It's recommended to create a new directory to house each cloud project which includes necessary files and nothing else, to optimize image build times.

  2. Using a notebook file as entry_point.

    If you have your tf.keras model in a notebook file (mnist_example.ipynb), then you can write the following simple script (scale_mnist.py) to scale your model on GCP.

    import tensorflow_cloud as tfc
    tfc.run(entry_point='mnist_example.ipynb')

    Please note that all the files in the same directory tree as entry_point will be packaged in the docker image created, along with the entry_point file. Like the python script entry_point above, we recommended creating a new directory to house each cloud project which includes necessary files and nothing else, to optimize image build times.

  3. Using run within a python script that contains the tf.keras model.

    You can use the run API from within your python file that contains the tf.keras model (mnist_scale.py). In this use case, entry_point should be None. The run API can be called anywhere and the entire file will be executed remotely. The API can be called at the end to run the script locally for debugging purposes (possibly with fewer epochs and other flags).

    import tensorflow_datasets as tfds
    import tensorflow as tf
    import tensorflow_cloud as tfc
    
    tfc.run(
        entry_point=None,
        distribution_strategy='auto',
        requirements_txt='requirements.txt',
        chief_config=tfc.MachineConfig(
                cpu_cores=8,
                memory=30,
                accelerator_type=tfc.AcceleratorType.NVIDIA_TESLA_T4,
                accelerator_count=2),
        worker_count=0)
    
    datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
    mnist_train, mnist_test = datasets['train'], datasets['test']
    
    num_train_examples = info.splits['train'].num_examples
    num_test_examples = info.splits['test'].num_examples
    
    BUFFER_SIZE = 10000
    BATCH_SIZE = 64
    
    def scale(image, label):
        image = tf.cast(image, tf.float32)
        image /= 255
        return image, label
    
    train_dataset = mnist_train.map(scale).cache()
    train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
    
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(
            28, 28, 1)),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    
    model.compile(loss='sparse_categorical_crossentropy',
                  optimizer=tf.keras.optimizers.Adam(),
                  metrics=['accuracy'])
    model.fit(train_dataset, epochs=12)

    Please note that all the files in the same directory tree as the python script will be packaged in the docker image created, along with the python file. It's recommended to create a new directory to house each cloud project which includes necessary files and nothing else, to optimize image build times.

  4. Using run within a notebook script that contains the tf.keras model.

    Image of colab

    In this use case, entry_point should be None and docker_config.image_build_bucket must be specified, to ensure the build can be stored and published.

    Cluster and distribution strategy configuration

    By default, run API takes care of wrapping your model code in a TensorFlow distribution strategy based on the cluster configuration you have provided.

    No distribution

    CPU chief config and no additional workers

    tfc.run(entry_point='mnist_example.py',
            chief_config=tfc.COMMON_MACHINE_CONFIGS['CPU'])

    OneDeviceStrategy

    1 GPU on chief (defaults to AcceleratorType.NVIDIA_TESLA_T4) and no additional workers.

    tfc.run(entry_point='mnist_example.py')

    MirroredStrategy

    Chief config with multiple GPUS (AcceleratorType.NVIDIA_TESLA_V100).

    tfc.run(entry_point='mnist_example.py',
            chief_config=tfc.COMMON_MACHINE_CONFIGS['V100_4X'])

    MultiWorkerMirroredStrategy

    Chief config with 1 GPU and 2 workers each with 8 GPUs (AcceleratorType.NVIDIA_TESLA_V100).

    tfc.run(entry_point='mnist_example.py',
            chief_config=tfc.COMMON_MACHINE_CONFIGS['V100_1X'],
            worker_count=2,
            worker_config=tfc.COMMON_MACHINE_CONFIGS['V100_8X'])

    TPUStrategy

    Chief config with 1 CPU and 1 worker with TPU.

    tfc.run(entry_point="mnist_example.py",
            chief_config=tfc.COMMON_MACHINE_CONFIGS["CPU"],
            worker_count=1,
            worker_config=tfc.COMMON_MACHINE_CONFIGS["TPU"])

    Please note that TPUStrategy with TensorFlow Cloud works only with TF version 2.1 as this is the latest version supported by AI Platform cloud TPU

    Custom distribution strategy

    If you would like to take care of specifying distribution strategy in your model code and do not want run API to create a strategy, then set distribution_stategy as None. This will be required for example when you are using strategy.experimental_distribute_dataset.

    tfc.run(entry_point='mnist_example.py',
            distribution_strategy=None,
            worker_count=2)

What happens when you call run?

The API call will encompass the following:

  1. Making code entities such as a Keras script/notebook, cloud and distribution ready.
  2. Converting this distribution entity into a docker container with the required dependencies.
  3. Deploy this container at scale and train using TensorFlow distribution strategies.
  4. Stream logs and monitor them on hosted TensorBoard, manage checkpoint storage.

By default, we will use local docker daemon for building and publishing docker images to Google container registry. Images are published to gcr.io/your-gcp-project-id. If you specify docker_config.image_build_bucket, then we will use Google Cloud build to build and publish docker images.

We use Google AI platform for deploying docker images on GCP.

Please note that, when entry_point argument is specified, all the files in the same directory tree as entry_point will be packaged in the docker image created, along with the entry_point file.

Please see run API documentation for detailed information on the parameters and how you can modify the above processes to suit your needs.

End to end examples

cd src/python/tensorflow_cloud/core
python tests/examples/call_run_on_script_with_keras_fit.py

Running unit tests

pytest src/python/tensorflow_cloud/core/tests/unit/

Local vs remote training

Things to keep in mind when running your jobs remotely:

[Coming soon]

Debugging workflow

Here are some tips for fixing unexpected issues.

Operation disallowed within distribution strategy scope

Error like: Creating a generator within a strategy scope is disallowed, because there is ambiguity on how to replicate a generator (e.g. should it be copied so that each replica gets the same random numbers, or 'split' so that each replica gets different random numbers).

Solution: Passing distribution_strategy='auto' to run API wraps all of your script in a TF distribution strategy based on the cluster configuration provided. You will see the above error or something similar to it, if for some reason an operation is not allowed inside distribution strategy scope. To fix the error, please pass None to the distribution_strategy param and create a strategy instance as part of your training code as shown in this example.

Docker image build timeout

Error like: requests.exceptions.ConnectionError: ('Connection aborted.', timeout('The write operation timed out'))

Solution: The directory being used as an entry point likely has too much data for the image to successfully build, and there may be extraneous data included in the build. Reformat your directory structure such that the folder which contains the entry point only includes files necessary for the current project.

Version not supported for TPU training

Error like: There was an error submitting the job.Field: tpu_tf_version Error: The specified runtime version '2.3' is not supported for TPU training. Please specify a different runtime version.

Solution: Please use TF version 2.1. See TPU Strategy in Cluster and distribution strategy configuration section.

TF nightly build.

Warning like: Docker parent image '2.4.0.dev20200720' does not exist. Using the latest TF nightly build.

Solution: If you do not provide docker_config.parent_image param, then by default we use pre-built TF docker images as parent image. If you do not have TF installed on the environment where run is called, then TF docker image for the latest stable release will be used. Otherwise, the version of the docker image will match the locally installed TF version. However, pre-built TF docker images aren't available for TF nightlies except for the latest. So, if your local TF is an older nightly version, we upgrade to the latest nightly automatically and raise this warning.

Mixing distribution strategy objects.

Error like: RuntimeError: Mixing different tf.distribute.Strategy objects.

Solution: Please provide distribution_strategy=None when you already have a distribution strategy defined in your model code. Specifying distribution_strategy'='auto', will wrap your code in a TensorFlow distribution strategy. This will cause the above error, if there is a strategy object already used in your code.

Coming up

  • Distributed Keras tuner support.

Contributing

We welcome community contributions, see CONTRIBUTING.md and, for style help, Writing TensorFlow documentation guide.

License

Apache License 2.0

Privacy Notice

This application reports technical and operational details of your usage of Cloud Services in accordance with Google privacy policy, for more information please refer to https://policies.google.com/privacy. If you wish to opt-out, you may do so by running tensorflow_cloud.utils.google_api_client.optout_metrics_reporting().

More Repositories

1

tensorflow

An Open Source Machine Learning Framework for Everyone
C++
181,486
star
2

models

Models and examples built with TensorFlow
Python
76,563
star
3

tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
TypeScript
18,104
star
4

tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Python
14,693
star
5

tfjs-models

Pretrained models for TensorFlow.js
TypeScript
13,679
star
6

playground

Play with neural networks!
TypeScript
11,585
star
7

tfjs-core

WebGL-accelerated ML // linear algebra // automatic differentiation for JavaScript.
TypeScript
8,491
star
8

examples

TensorFlow examples
Jupyter Notebook
7,681
star
9

tensorboard

TensorFlow's Visualization Toolkit
TypeScript
6,500
star
10

tfjs-examples

Examples built with TensorFlow.js
JavaScript
6,423
star
11

nmt

TensorFlow Neural Machine Translation Tutorial
Python
6,315
star
12

swift

Swift for TensorFlow
Jupyter Notebook
6,118
star
13

serving

A flexible, high-performance serving system for machine learning models
C++
6,068
star
14

docs

TensorFlow documentation
Jupyter Notebook
5,997
star
15

tpu

Reference models and tools for Cloud TPUs.
Jupyter Notebook
5,177
star
16

rust

Rust language bindings for TensorFlow
Rust
4,939
star
17

lucid

A collection of infrastructure and tools for research in neural network interpretability.
Jupyter Notebook
4,611
star
18

datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
Python
4,156
star
19

probability

Probabilistic reasoning and statistical analysis in TensorFlow
Jupyter Notebook
4,053
star
20

adanet

Fast and flexible AutoML with learning guarantees.
Jupyter Notebook
3,474
star
21

hub

A library for transfer learning by reusing parts of TensorFlow models.
Python
3,434
star
22

minigo

An open-source implementation of the AlphaGoZero algorithm
C++
3,428
star
23

skflow

Simplified interface for TensorFlow (mimicking Scikit Learn) for Deep Learning
Python
3,185
star
24

lingvo

Lingvo
Python
2,777
star
25

graphics

TensorFlow Graphics: Differentiable Graphics Layers for TensorFlow
Python
2,738
star
26

agents

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.
Python
2,717
star
27

ranking

Learning to Rank in TensorFlow
Python
2,713
star
28

federated

A framework for implementing federated learning
Python
2,271
star
29

tfx

TFX is an end-to-end platform for deploying production ML pipelines
Python
2,073
star
30

privacy

Library for training machine learning models with privacy for training data
Python
1,862
star
31

fold

Deep learning with dynamic computation graphs in TensorFlow
Python
1,825
star
32

recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.
Python
1,739
star
33

quantum

Hybrid Quantum-Classical Machine Learning in TensorFlow
Python
1,723
star
34

mlir

"Multi-Level Intermediate Representation" Compiler Infrastructure
1,720
star
35

addons

Useful extra functionality for TensorFlow 2.x maintained by SIG-addons
Python
1,677
star
36

tflite-micro

Infrastructure to enable deployment of ML models to low-power resource-constrained embedded targets (including microcontrollers and digital signal processors).
C++
1,629
star
37

haskell

Haskell bindings for TensorFlow
Haskell
1,558
star
38

mesh

Mesh TensorFlow: Model Parallelism Made Easier
Python
1,540
star
39

model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
Python
1,459
star
40

workshops

A few exercises for use at events.
Jupyter Notebook
1,457
star
41

ecosystem

Integration of TensorFlow with other open-source frameworks
Scala
1,362
star
42

gnn

TensorFlow GNN is a library to build Graph Neural Networks on the TensorFlow platform.
Python
1,260
star
43

community

Stores documents used by the TensorFlow developer community
C++
1,239
star
44

model-analysis

Model analysis tools for TensorFlow
Python
1,234
star
45

text

Making text a first-class citizen in TensorFlow.
C++
1,194
star
46

benchmarks

A benchmark framework for Tensorflow
Python
1,130
star
47

tfjs-node

TensorFlow powered JavaScript library for training and deploying ML models on Node.js.
TypeScript
1,048
star
48

similarity

TensorFlow Similarity is a python package focused on making similarity learning quick and easy.
Python
994
star
49

transform

Input pipeline framework
Python
982
star
50

neural-structured-learning

Training neural models with structured signals.
Python
976
star
51

gan

Tooling for GANs in TensorFlow
Jupyter Notebook
907
star
52

compression

Data compression in TensorFlow
Python
806
star
53

swift-apis

Swift for TensorFlow Deep Learning Library
Swift
794
star
54

deepmath

Experiments towards neural network theorem proving
C++
779
star
55

data-validation

Library for exploring and validating machine learning data
Python
748
star
56

runtime

A performant and modular runtime for TensorFlow
C++
746
star
57

java

Java bindings for TensorFlow
Java
730
star
58

tensorrt

TensorFlow/TensorRT integration
Jupyter Notebook
723
star
59

tfjs-converter

Convert TensorFlow SavedModel and Keras models to TensorFlow.js
TypeScript
697
star
60

io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
C++
686
star
61

docs-l10n

Translations of TensorFlow documentation
Jupyter Notebook
684
star
62

swift-models

Models and examples built with Swift for TensorFlow
Jupyter Notebook
644
star
63

decision-forests

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.
Python
643
star
64

tcav

Code for the TCAV ML interpretability project
Jupyter Notebook
612
star
65

recommenders-addons

Additional utils and helpers to extend TensorFlow when build recommendation systems, contributed and maintained by SIG Recommenders.
Cuda
547
star
66

tfjs-wechat

WeChat Mini-program plugin for TensorFlow.js
TypeScript
524
star
67

lattice

Lattice methods in TensorFlow
Python
519
star
68

model-card-toolkit

A toolkit that streamlines and automates the generation of model cards
Python
400
star
69

flutter-tflite

Dart
385
star
70

custom-op

Guide for building custom op for TensorFlow
Smarty
370
star
71

mlir-hlo

MLIR
361
star
72

tfjs-vis

A set of utilities for in browser visualization with TensorFlow.js
TypeScript
360
star
73

tflite-support

TFLite Support is a toolkit that helps users to develop ML and deploy TFLite models onto mobile / ioT devices.
C++
354
star
74

profiler

A profiling and performance analysis tool for TensorFlow
TypeScript
344
star
75

fairness-indicators

Tensorflow's Fairness Evaluation and Visualization Toolkit
Jupyter Notebook
330
star
76

moonlight

Optical music recognition in TensorFlow
Python
325
star
77

tfjs-tsne

TypeScript
309
star
78

estimator

TensorFlow Estimator
Python
295
star
79

embedding-projector-standalone

HTML
284
star
80

tfjs-layers

TensorFlow.js high-level layers API
TypeScript
283
star
81

build

Build-related tools for TensorFlow
Shell
248
star
82

kfac

An implementation of KFAC for TensorFlow
Python
195
star
83

tflite-micro-arduino-examples

C++
171
star
84

ngraph-bridge

TensorFlow-nGraph bridge
C++
138
star
85

profiler-ui

[Deprecated] The TensorFlow Profiler (TFProf) UI provides a visual interface for profiling TensorFlow models.
HTML
134
star
86

tensorboard-plugin-example

Python
134
star
87

tfx-addons

Developers helping developers. TFX-Addons is a collection of community projects to build new components, examples, libraries, and tools for TFX. The projects are organized under the auspices of the special interest group, SIG TFX-Addons. Join the group at http://goo.gle/tfx-addons-group
Jupyter Notebook
121
star
88

metadata

Utilities for passing TensorFlow-related metadata between tools
Python
102
star
89

networking

Enhanced networking support for TensorFlow. Maintained by SIG-networking.
C++
97
star
90

tfhub.dev

Python
71
star
91

tfjs-website

WebGL-accelerated ML // linear algebra // automatic differentiation for JavaScript.
CSS
69
star
92

java-models

Models in Java
Java
68
star
93

java-ndarray

Java
66
star
94

tfjs-data

Simple APIs to load and prepare data for use in machine learning models
TypeScript
66
star
95

tfx-bsl

Common code for TFX
Python
61
star
96

autograph

Python
50
star
97

model-remediation

Model Remediation is a library that provides solutions for machine learning practitioners working to create and train models in a way that reduces or eliminates user harm resulting from underlying performance biases.
Python
42
star
98

codelabs

Jupyter Notebook
36
star
99

tensorstore

C++
25
star
100

swift-bindings

Swift
25
star