• Stars
    star
    1,002
  • Rank 41,914 (Top 0.9 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created almost 6 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Google-provided Cloud Dataflow template pipelines for solving simple in-Cloud data tasks

Google Cloud Dataflow Template Pipelines

These Dataflow templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations, without a development environment. The technology under the hood which makes these operations possible is the Google Cloud Dataflow service combined with a set of Apache Beam SDK templated pipelines.

Google is providing this collection of pre-implemented Dataflow templates as a reference and to provide easy customization for developers wanting to extend their functionality.

Open in Cloud Shell

Note on Default Branch

As of November 18, 2021, our default branch is now named "main". This does not affect forks. If you would like your fork and its local clone to reflect these changes you can follow GitHub's branch renaming guide.

Building

Maven commands should be run on the parent POM. An example would be:

mvn clean package -pl v2/pubsub-binary-to-bigquery -am

Template Pipelines

For documentation on each template's usage and parameters, please see the official docs.

Getting Started

Requirements

  • Java 11
  • Maven 3

Building the Project

Build the entire project using the maven compile command.

mvn clean compile

Building/Testing from IntelliJ

IntelliJ, by default, will often skip necessary Maven goals, leading to build failures. You can fix these in the Maven view by going to Module_Name > Plugins > Plugin_Name where Module_Name and Plugin_Name are the names of the respective module and plugin with the rule. From there, right-click the rule and select "Execute Before Build".

The list of known rules that require this are:

  • common > Plugins > protobuf > protobuf:compile
  • common > Plugins > protobuf > protobuf:test-compile

Formatting Code

From either the root directory or v2/ directory, run:

mvn spotless:apply

This will format the code and add a license header. To verify that the code is formatted correctly, run:

mvn spotless:check

Executing a Template File

Once the template is staged on Google Cloud Storage, it can then be executed using the gcloud CLI tool. Please check Running classic templates or Using Flex Templates for more information.

Developing/Contributing Templates

Templates Plugin

Templates plugin was created to make the workflow of creating, testing and releasing Templates easier.

Before using the plugin, please make sure that the gcloud CLI is installed and up-to-date, and that the client is properly authenticated using:

gcloud init
gcloud auth application-default login

After authenticated, install the plugin into your local repository:

mvn clean install -pl plugins/templates-maven-plugin -am

Staging (Deploying) Templates

To stage a Template, it is necessary to upload the images to Artifact Registry (for Flex templates) and copy the template to Cloud Storage.

Although there are different steps that depend on the kind of template being developed. The plugin allows a template to be staged using the following single command:

mvn clean package -PtemplatesStage  \
  -DskipTests \
  -DprojectId="{projectId}" \
  -DbucketName="{bucketName}" \
  -DstagePrefix="images/$(date +%Y_%m_%d)_01" \
  -DtemplateName="Cloud_PubSub_to_GCS_Text_Flex" \
  -pl v2/googlecloud-to-googlecloud -am

Notes:

  • Change -pl v2/googlecloud-to-googlecloud and -DtemplateName to point to the specific Maven module where your template is located. Even though -pl is not required, it allows the command to run considerably faster.
  • In case -DtemplateName is not specified, all templates for the module will be staged.

Running a Template

A template can also be executed on Dataflow, directly from the command line. The command-line is similar to staging a template, but it is required to specify -Dparameters with the parameters that will be used when launching the template. For example:

mvn clean package -PtemplatesRun \
  -DskipTests \
  -DprojectId="{projectId}" \
  -DbucketName="{bucketName}" \
  -Dregion="us-central1" \
  -DtemplateName="Cloud_PubSub_to_GCS_Text_Flex" \
  -Dparameters="inputTopic=projects/{projectId}/topics/{topicName},windowDuration=15s,outputDirectory=gs://{outputDirectory}/out,outputFilenamePrefix=output-,outputFilenameSuffix=.txt" \
  -pl v2/googlecloud-to-googlecloud -am

Notes:

  • When running a template, -DtemplateName is mandatory, as -Dparameters= are different across templates.
  • -PtemplatesRun is self-contained, i.e., it is not required to run ** Deploying/Staging Templates** before. In case you want to run a previously staged template, the existing path can be provided as -DspecPath=gs://.../path
  • -DjobName="{name}" may be informed if a specific name is desirable ( optional).
  • If you encounter the error Template run failed: File too large, try adding -DskipShade to the mvn args.

Running Integration Tests

To run integration tests, the developer plugin can be also used to stage template on-demand (in case the parameter -DspecPath= is not specified).

For example, to run all the integration tests in a specific module (in the example below, v2/googlecloud-to-googlecloud):

mvn clean verify \
  -PtemplatesIntegrationTests \
  -Dproject="{project}" \
  -DartifactBucket="{bucketName}" \
  -Dregion=us-central1 \
  -pl v2/googlecloud-to-googlecloud -am

The parameter -Dtest= can be given to test a single class (e.g., -Dtest=PubsubToTextIT) or single test case (e.g., -Dtest=PubsubToTextIT#testTopicToGcs).

The same happens when the test is executed from an IDE, just make sure to add the parameters -Dproject=, -DartifactBucket= and -Dregion= as program or VM arguments.

Metadata Annotations

A template requires more information than just a name and description. For example, in order to be used from the Dataflow UI, parameters need a longer help text to guide users, as well as proper types and validations to make sure parameters are being passed correctly.

We introduced annotations to have the source code as a single source of truth, along with a set of utilities / plugins to generate template-accompanying artifacts (such as command specs, parameter specs).

@Template Annotation

Every template must be annotated with @Template. Existing templates can be used for reference, but the structure is as follows:

@Template(
    name = "BigQuery_to_Elasticsearch",
    category = TemplateCategory.BATCH,
    displayName = "BigQuery to Elasticsearch",
    description = "A pipeline which sends BigQuery records into an Elasticsearch instance as JSON documents.",
    optionsClass = BigQueryToElasticsearchOptions.class,
    flexContainerName = "bigquery-to-elasticsearch")
public class BigQueryToElasticsearch {

@TemplateParameter Annotation

A set of @TemplateParameter.{Type} annotations were created to allow the definition of options for a template, and the proper rendering in the UI, and validations by the template launch service. Examples can be found in the repository, but the general structure is as follows:

@TemplateParameter.Text(
    order = 2,
    optional = false,
    regexes = {"[,a-zA-Z0-9._-]+"},
    description = "Kafka topic(s) to read the input from",
    helpText = "Kafka topic(s) to read the input from.",
    example = "topic1,topic2")
@Validation.Required
String getInputTopics();
@TemplateParameter.GcsReadFile(
    order = 1,
    description = "Cloud Storage Input File(s)",
    helpText = "Path of the file pattern glob to read from.",
    example = "gs://your-bucket/path/*.csv")
String getInputFilePattern();
@TemplateParameter.Boolean(
    order = 11,
    optional = true,
    description = "Whether to use column alias to map the rows.",
    helpText = "If enabled (set to true) the pipeline will consider column alias (\"AS\") instead of the column name to map the rows to BigQuery.")
@Default.Boolean(false)
Boolean getUseColumnAlias();
@TemplateParameter.Enum(
    order = 21,
    enumOptions = {"INDEX", "CREATE"},
    optional = true,
    description = "Build insert method",
    helpText = "Whether to use INDEX (index, allows upsert) or CREATE (create, errors on duplicate _id) with Elasticsearch bulk requests.")
@Default.Enum("CREATE")
BulkInsertMethodOptions getBulkInsertMethod();

Note: order is relevant for templates that can be used from the UI, and specify the relative order of parameters.

@TemplateIntegrationTest Annotation

This annotation should be used by classes that are used for integration tests of other templates. This is used to wire a specific IT class with a template, and allows environment preparation / proper template staging before tests are executed on Dataflow.

Template tests have to follow this general format (please note the @TemplateIntegrationTest annotation and the TemplateTestBase super-class):

@TemplateIntegrationTest(PubsubToText.class)
@RunWith(JUnit4.class)
public final class PubsubToTextIT extends TemplateTestBase {

Please refer to Templates Plugin to use and validate such annotations.

Using UDFs

User-defined functions (UDFs) allow you to customize a template's functionality by providing a short JavaScript function without having to maintain the entire codebase. This is useful in situations which you'd like to rename fields, filter values, or even transform data formats before output to the destination. All UDFs are executed by providing the payload of the element as a string to the JavaScript function. You can then use JavaScript's in-built JSON parser or other system functions to transform the data prior to the pipeline's output. The return statement of a UDF specifies the payload to pass forward in the pipeline. This should always return a string value. If no value is returned or the function returns undefined, the incoming record will be filtered from the output.

UDF Function Specification

Template UDF Input Type Input Description UDF Output Type Output Description
Datastore Bulk Delete String A JSON string of the entity String A JSON string of the entity to delete; filter entities by returning undefined
Datastore to Pub/Sub String A JSON string of the entity String The payload to publish to Pub/Sub
Datastore to GCS Text String A JSON string of the entity String A single-line within the output file
GCS Text to BigQuery String A single-line within the input file String A JSON string which matches the destination table's schema
Pub/Sub to BigQuery String A string representation of the incoming payload String A JSON string which matches the destination table's schema
Pub/Sub to Datastore String A string representation of the incoming payload String A JSON string of the entity to write to Datastore
Pub/Sub to Splunk String A string representation of the incoming payload String The event data to be sent to Splunk HEC events endpoint. Must be a string or a stringified JSON object

UDF Examples

For a comprehensive list of samples, please check our udf-samples folder.

Adding fields

/**
 * A transform which adds a field to the incoming data.
 * @param {string} inJson
 * @return {string} outJson
 */
function transform(inJson) {
  var obj = JSON.parse(inJson);
  obj.dataFeed = "Real-time Transactions";
  obj.dataSource = "POS";
  return JSON.stringify(obj);
}

Filtering records

/**
 * A transform function which only accepts 42 as the answer to life.
 * @param {string} inJson
 * @return {string} outJson
 */
function transform(inJson) {
  var obj = JSON.parse(inJson);
  // only output objects which have an answer to life of 42.
  if (obj.hasOwnProperty('answerToLife') && obj.answerToLife === 42) {
    return JSON.stringify(obj);
  }
}

Generated Documentation

This repository contains generated documentation, which contains a list of parameters and instructions on how to customize and/or build every template.

To generate the documentation for all templates, the following command can be used:

mvn clean prepare-package \
  -DskipTests \
  -PtemplatesSpec

Release Process

Templates are released in a weekly basis (best-effort) as part of the efforts to keep Google-provided Templates updated with latest fixes and improvements.

In case desired, you can stage and use your own changes using the Staging (Deploying) Templates steps.

To execute the release of multiple templates, we provide a single Maven command to release Templates, which is a shortcut to stage all templates while running additional validations.

mvn clean verify -PtemplatesRelease \
  -DprojectId="{projectId}" \
  -DbucketName="{bucketName}" \
  -DlibrariesBucketName="{bucketName}-libraries" \
  -DstagePrefix="$(date +%Y_%m_%d)-00_RC00"

More Information

  • Dataflow Templates - basic template concepts.
  • Google-provided Templates - official documentation for templates provided by Google (the source code is in this repository).
  • Dataflow Cookbook: Blog, GitHub Repository - pipeline examples and practical solutions to common data processing challenges.
  • Dataflow Metrics Collector - CLI tool to collect dataflow resource & execution metrics and export to either BigQuery or Google Cloud Storage. Useful for comparison and visualization of the metrics while benchmarking the dataflow pipelines using various data formats, resource configurations etc
  • Apache Beam
    • Overview
    • Quickstart: Java, Python, Go
    • Tour of Beam - an interactive tour with learning topics covering core Beam concepts from simple ones to more advanced ones.
    • Beam Playground - an interactive environment to try out Beam transforms and examples without having to install Apache Beam.
    • Beam College - hands-on training and practical tips, including video recordings of Apache Beam and Dataflow Templates lessons.
    • Getting Started with Apache Beam - Quest - A 5 lab series that provides a Google Cloud certified badge upon completion.

More Repositories

1

microservices-demo

Sample cloud-first application with 10 microservices showcasing Kubernetes, Istio, and gRPC.
Python
14,827
star
2

terraformer

CLI tool to generate terraform files from existing infrastructure (reverse Terraform). Infrastructure to Code
Go
11,054
star
3

training-data-analyst

Labs and demos for courses for GCP Training (http://cloud.google.com/training).
Jupyter Notebook
7,197
star
4

python-docs-samples

Code samples used on cloud.google.com
Jupyter Notebook
6,660
star
5

golang-samples

Sample apps and code written for Google Cloud in the Go programming language.
Go
4,030
star
6

nodejs-docs-samples

Node.js samples for Google Cloud Platform products.
JavaScript
2,692
star
7

tensorflow-without-a-phd

A crash course in six episodes for software developers who want to become machine learning practitioners.
Jupyter Notebook
2,677
star
8

professional-services

Common solutions and tools developed by Google Cloud's Professional Services team. This repository and its contents are not an officially supported Google product.
Python
2,593
star
9

generative-ai

Sample code and notebooks for Generative AI on Google Cloud
Jupyter Notebook
2,571
star
10

spark-on-k8s-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Go
2,474
star
11

community

Java
1,891
star
12

gcsfuse

A user-space file system for interacting with Google Cloud Storage
Go
1,888
star
13

PerfKitBenchmarker

PerfKit Benchmarker (PKB) contains a set of benchmarks to measure and compare cloud offerings. The benchmarks use default settings to reflect what most users will see. PerfKit Benchmarker is licensed under the Apache 2 license terms. Please make sure to read, understand and agree to the terms of the LICENSE and CONTRIBUTING files before proceeding.
Python
1,832
star
14

java-docs-samples

Java and Kotlin Code samples used on cloud.google.com
Java
1,610
star
15

ml-design-patterns

Source code accompanying O'Reilly book: Machine Learning Design Patterns
Jupyter Notebook
1,600
star
16

continuous-deployment-on-kubernetes

Get up and running with Jenkins on Google Kubernetes Engine
Shell
1,582
star
17

cloudml-samples

Cloud ML Engine repo. Please visit the new Vertex AI samples repo at https://github.com/GoogleCloudPlatform/vertex-ai-samples
Python
1,503
star
18

cloud-builders

Builder images and examples commonly used for Google Cloud Build
Go
1,318
star
19

asl-ml-immersion

This repos contains notebooks for the Advanced Solutions Lab: ML Immersion
Jupyter Notebook
1,249
star
20

data-science-on-gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Jupyter Notebook
1,230
star
21

berglas

A tool for managing secrets on Google Cloud
Go
1,203
star
22

cloud-builders-community

Community-contributed images for Google Cloud Build
Go
1,200
star
23

cloud-foundation-fabric

End-to-end modular samples and landing zones toolkit for Terraform on GCP.
HCL
1,193
star
24

cloud-sql-proxy

A utility for connecting securely to your Cloud SQL instances
Go
1,174
star
25

functions-framework-nodejs

FaaS (Function as a service) framework for writing portable Node.js functions
TypeScript
1,162
star
26

kubernetes-engine-samples

Sample applications for Google Kubernetes Engine (GKE)
HCL
1,141
star
27

cloud-vision

Sample code for Google Cloud Vision
Python
1,094
star
28

bigquery-utils

Useful scripts, udfs, views, and other utilities for migration and data warehouse operations in BigQuery.
Java
958
star
29

php-docs-samples

A collection of samples that demonstrate how to call Google Cloud services from PHP.
PHP
935
star
30

deploymentmanager-samples

Deployment Manager samples and templates.
Jinja
924
star
31

buildpacks

Builders and buildpacks designed to run on Google Cloud's container platforms
Go
920
star
32

flask-talisman

HTTP security headers for Flask
Python
896
star
33

cloud-foundation-toolkit

The Cloud Foundation toolkit provides GCP best practices as code.
Go
875
star
34

bank-of-anthos

Retail banking sample application showcasing Kubernetes and Google Cloud
Java
864
star
35

DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
857
star
36

gsutil

A command line tool for interacting with cloud storage services.
Python
838
star
37

vertex-ai-samples

Sample code and notebooks for Vertex AI, the end-to-end machine learning platform on Google Cloud
Jupyter Notebook
823
star
38

keras-idiomatic-programmer

Books, Presentations, Workshops, Notebook Labs, and Model Zoo for Software Engineers and Data Scientists wanting to learn the TF.Keras Machine Learning framework
Jupyter Notebook
797
star
39

metacontroller

Lightweight Kubernetes controllers as a service
Go
790
star
40

nodejs-getting-started

A tutorial for creating a complete application using Node.js on Google Cloud Platform
JavaScript
787
star
41

gcr-cleaner

Delete untagged image refs in Google Container Registry or Artifact Registry
Go
784
star
42

getting-started-python

Code samples for using Python on Google Cloud Platform
Python
756
star
43

magic-modules

Add Google Cloud Platform support to Terraform
HTML
710
star
44

awesome-google-cloud

A curated list of awesome stuff for Google Cloud Platform.
701
star
45

dotnet-docs-samples

.NET code samples used on https://cloud.google.com
C#
699
star
46

cloud-sdk-docker

Google Cloud CLI Docker Image - Docker Image containing the gcloud CLI and its bundled components.
Dockerfile
697
star
47

click-to-deploy

Source for Google Click to Deploy solutions listed on Google Cloud Marketplace.
Ruby
691
star
48

tf-estimator-tutorials

This repository includes tutorials on how to use the TensorFlow estimator APIs to perform various ML tasks, in a systematic and standardised way
Jupyter Notebook
673
star
49

functions-framework-python

FaaS (Function as a service) framework for writing portable Python functions
Python
670
star
50

mlops-on-gcp

Jupyter Notebook
669
star
51

k8s-config-connector

GCP Config Connector, a Kubernetes add-on for managing GCP resources
Go
664
star
52

flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
Go
660
star
53

iap-desktop

IAP Desktop is a Windows application that provides zero-trust Remote Desktop and SSH access to Linux and Windows VMs on Google Cloud.
C#
636
star
54

terraform-google-examples

Collection of examples for using Terraform with Google Cloud Platform.
HCL
573
star
55

functions-framework-dart

FaaS (Function as a service) framework for writing portable Dart functions
Dart
523
star
56

cloud-run-button

Let anyone deploy your GitHub repos to Google Cloud Run with a single click
Go
512
star
57

govanityurls

Use a custom domain in your Go import path
Go
506
star
58

bigquery-oreilly-book

Source code accompanying: BigQuery: The Definitive Guide by Lakshmanan & Tigani to be published by O'Reilly Media
Jupyter Notebook
477
star
59

getting-started-java

Java
474
star
60

ipython-soccer-predictions

Sample iPython notebook with soccer predictions
Jupyter Notebook
473
star
61

ml-on-gcp

Machine Learning on Google Cloud Platform
Python
471
star
62

covid-19-open-data

Datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world.
Python
459
star
63

ai-platform-samples

Official Repo for Google Cloud AI Platform. Find samples for Vertex AI, Google Cloud's new unified ML platform at: https://github.com/GoogleCloudPlatform/vertex-ai-samples
Jupyter Notebook
443
star
64

gradle-appengine-templates

Freemarker based templates that build with the gradle-appengine-plugin
439
star
65

distributed-load-testing-using-kubernetes

Distributed load testing using Kubernetes on Google Container Engine
Smarty
438
star
66

terraform-validator

Terraform Validator is not an officially supported Google product; it is a library for conversion of Terraform plan data to CAI Assets. If you have been using terraform-validator directly in the past, we recommend migrating to `gcloud beta terraform vet`.
Go
437
star
67

hackathon-toolkit

GCP Hackathon Toolkit
HTML
428
star
68

practical-ml-vision-book

Jupyter Notebook
408
star
69

nodejs-docker

The Node.js Docker image used by Google App Engine Flexible.
TypeScript
407
star
70

monitoring-dashboard-samples

TypeScript
403
star
71

cloud-ops-sandbox

Cloud Operations Sandbox is an open source collection of tools that helps practitioners to learn O11y and R9y practices from Google and apply them using Cloud Operations suite of tools.
HCL
391
star
72

k8s-stackdriver

Go
377
star
73

functions-framework-go

FaaS (Function as a service) framework for writing portable Go functions
Go
373
star
74

k8s-multicluster-ingress

kubemci: Command line tool to configure L7 load balancers using multiple kubernetes clusters
Go
372
star
75

require-so-slow

`require`s taking too much time? Profile 'em.
TypeScript
372
star
76

compute-image-packages

Packages for Google Compute Engine Linux images.
Python
370
star
77

cloud-code-vscode

Cloud Code for Visual Studio Code: Issues, Documentation and more
370
star
78

android-docs-samples

Java
365
star
79

cloud-code-samples

Code templates to make working with Kubernetes feel like editing and debugging local code.
Java
361
star
80

healthcare

Jupyter Notebook
357
star
81

stackdriver-errors-js

Client-side JavaScript exception reporting library for Cloud Error Reporting
JavaScript
352
star
82

professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Python
341
star
83

google-cloud-iot-arduino

Google Cloud IOT Example on ESP8266
C++
340
star
84

istio-samples

Istio demos and sample applications for GCP
Shell
329
star
85

ios-docs-samples

iOS samples that demonstrate APIs and services of Google Cloud Platform.
Swift
325
star
86

cloud-code-intellij

Plugin to support the Google Cloud Platform in IntelliJ IDEA - Docs and Issues Repository
309
star
87

gcping

The source for the CLI and web app at gcping.com
Go
303
star
88

spring-cloud-gcp

New home for Spring Cloud GCP development starting with version 2.0.
Java
299
star
89

airflow-operator

Kubernetes custom controller and CRDs to managing Airflow
Go
296
star
90

mlops-with-vertex-ai

An end-to-end example of MLOps on Google Cloud using TensorFlow, TFX, and Vertex AI
Jupyter Notebook
296
star
91

elixir-samples

A collection of samples on using Elixir with Google Cloud Platform.
Elixir
284
star
92

datalab-samples

Jupyter Notebook
281
star
93

solutions-terraform-cloudbuild-gitops

HCL
270
star
94

PerfKitExplorer

PerfKit Explorer is a dashboarding and performance analysis tool built with Google technologies and easily extensible. PerfKit Explorer is licensed under the Apache 2 license terms. Please make sure to read, understand and agree to the terms of the LICENSE and CONTRIBUTING files before proceeding.
JavaScript
268
star
95

compute-archlinux-image-builder

A tool to build a Arch Linux Image for GCE
Shell
267
star
96

kotlin-samples

Kotlin
263
star
97

kube-jenkins-imager

Shell
261
star
98

Template

This repo contains a descriptive wiki and boilerplate copies of CONTRIBUTING.md, LICENSE, and README.md files for use by other repos in this collection.
261
star
99

docker-credential-gcr

A Docker credential helper for GCR users
Go
260
star
100

gcpdiag

gcpdiag is a command-line diagnostics tool for GCP customers.
Python
257
star