• Stars
    star
    262
  • Rank 156,136 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.

dsub: simple batch jobs with Docker

License

Overview

dsub is a command-line tool that makes it easy to submit and run batch scripts in the cloud.

The dsub user experience is modeled after traditional high-performance computing job schedulers like Grid Engine and Slurm. You write a script and then submit it to a job scheduler from a shell prompt on your local machine.

Today dsub supports Google Cloud as the backend batch job runner, along with a local provider for development and testing. With help from the community, we'd like to add other backends, such as a Grid Engine, Slurm, Amazon Batch, and Azure Batch.

Getting started

dsub is written in Python and requires Python 3.7 or higher.

  • The last version to support Python 3.6 was dsub 0.4.7.
  • For earlier versions of Python 3, use dsub 0.4.1.
  • For Python 2, use dsub 0.3.10.

Pre-installation steps

Create a Python virtual environment

This is optional, but whether installing from PyPI or from github, you are strongly encouraged to use a Python virtual environment.

You can do this in a directory of your choosing.

    python3 -m venv dsub_libs
    source dsub_libs/bin/activate

Using a Python virtual environment isolates dsub library dependencies from other Python applications on your system.

Activate this virtual environment in any shell session before running dsub. To deactivate the virtual environment in your shell, run the command:

    deactivate

Alternatively, a set of convenience scripts are provided that activate the virutalenv before calling dsub, dstat, and ddel. They are in the bin directory. You can use these scripts if you don't want to activate the virtualenv explicitly in your shell.

Install the Google Cloud SDK

While not used directly by dsub for the google-v2 or google-cls-v2 providers, you are likely to want to install the command line tools found in the Google Cloud SDK.

If you will be using the local provider for faster job development, you will need to install the Google Cloud SDK, which uses gsutil to ensure file operation semantics consistent with the Google dsub providers.

  1. Install the Google Cloud SDK

  2. Run

     gcloud init
    

    gcloud will prompt you to set your default project and to grant credentials to the Google Cloud SDK.

Install dsub

Choose one of the following:

Install from PyPI

  1. If necessary, install pip.

  2. Install dsub

     pip install dsub
    

Install from github

  1. Be sure you have git installed

    Instructions for your environment can be found on the git website.

  2. Clone this repository.

    git clone https://github.com/DataBiosphere/dsub
    cd dsub
    
  3. Install dsub (this will also install the dependencies)

    python -m pip install .
    
  4. Set up Bash tab completion (optional).

    source bash_tab_complete
    

Post-installation steps

  1. Minimally verify the installation by running:

    dsub --help
    
  2. (Optional) Install Docker.

    This is necessary only if you're going to create your own Docker images or use the local provider.

Makefile

After cloning the dsub repo, you can also use the Makefile by running:

    make

This will create a Python virtual environment and install dsub into a directory named dsub_libs.

Getting started with the local provider

We think you'll find the local provider to be very helpful when building your dsub tasks. Instead of submitting a request to run your command on a cloud VM, the local provider runs your dsub tasks on your local machine.

The local provider is not designed for running at scale. It is designed to emulate running on a cloud VM such that you can rapidly iterate. You'll get quicker turnaround times and won't incur cloud charges using it.

  1. Run a dsub job and wait for completion.

    Here is a very simple "Hello World" test:

     dsub \
       --provider local \
       --logging "${TMPDIR:-/tmp}/dsub-test/logging/" \
       --output OUT="${TMPDIR:-/tmp}/dsub-test/output/out.txt" \
       --command 'echo "Hello World" > "${OUT}"' \
       --wait
    

    Note: TMPDIR is commonly set to /tmp by default on most Unix systems, although it is also often left unset. On some versions of MacOS TMPDIR is set to a location under /var/folders.

    Note: The above syntax ${TMPDIR:-/tmp} is known to be supported by Bash, zsh, ksh. The shell will expand TMPDIR, but if it is unset, /tmp will be used.

  2. View the output file.

     cat "${TMPDIR:-/tmp}/dsub-test/output/out.txt"
    

Getting started on Google Cloud

dsub supports the use of two different APIs from Google Cloud for running tasks. Google Cloud is transitioning from Genomics v2alpha1 to Cloud Life Sciences v2beta.

dsub supports both APIs with the (old) google-v2 and (new) google-cls-v2 providers respectively. google-v2 is the current default provider. dsub will be transitioning to make google-cls-v2 the default in coming releases.

The steps for getting started differ slightly as indicated in the steps below:

  1. Sign up for a Google account and create a project.

  2. Enable the APIs:

    • For the v2alpha1 API (provider: google-v2):

    Enable the Genomics, Storage, and Compute APIs.

    • For the v2beta API (provider: google-cls-v2):

    Enable the Cloud Life Sciences, Storage, and Compute APIs

  3. Provide credentials so dsub can call Google APIs:

     gcloud auth application-default login
    
  4. Create a Google Cloud Storage bucket.

    The dsub logs and output files will be written to a bucket. Create a bucket using the storage browser or run the command-line utility gsutil, included in the Cloud SDK.

    gsutil mb gs://my-bucket
    

    Change my-bucket to a unique name that follows the bucket-naming conventions.

    (By default, the bucket will be in the US, but you can change or refine the location setting with the -l option.)

  5. Run a very simple "Hello World" dsub job and wait for completion.

    • For the v2alpha1 API (provider: google-v2):

        dsub \
          --provider google-v2 \
          --project my-cloud-project \
          --regions us-central1 \
          --logging gs://my-bucket/logging/ \
          --output OUT=gs://my-bucket/output/out.txt \
          --command 'echo "Hello World" > "${OUT}"' \
          --wait
      

    Change my-cloud-project to your Google Cloud project, and my-bucket to the bucket you created above.

    • For the v2beta API (provider: google-cls-v2):

        dsub \
          --provider google-cls-v2 \
          --project my-cloud-project \
          --regions us-central1 \
          --logging gs://my-bucket/logging/ \
          --output OUT=gs://my-bucket/output/out.txt \
          --command 'echo "Hello World" > "${OUT}"' \
          --wait
      

    Change my-cloud-project to your Google Cloud project, and my-bucket to the bucket you created above.

    The output of the script command will be written to the OUT file in Cloud Storage that you specify.

  6. View the output file.

     gsutil cat gs://my-bucket/output/out.txt
    

Backend providers

Where possible, dsub tries to support users being able to develop and test locally (for faster iteration) and then progressing to running at scale.

To this end, dsub provides multiple "backend providers", each of which implements a consistent runtime environment. The current providers are:

  • local
  • google-v2 (the default)
  • google-cls-v2 (new)

More details on the runtime environment implemented by the backend providers can be found in dsub backend providers.

Differences between google-v2 and google-cls-v2

The google-cls-v2 provider is built on the Cloud Life Sciences v2beta API. This API is very similar to its predecessor, the Genomics v2alpha1 API. Details of the differences can be found in the Migration Guide.

dsub largely hides the differences between the two APIs, but there are a few difference to note:

  • v2beta is a regional service, v2alpha1 is a global service

What this means is that with v2alpha1, the metadata about your tasks (called "operations"), is stored in a global database, while with v2beta, the metadata about your tasks are stored in a regional database. If your operation information needs to stay in a particular region, use the v2beta API (the google-cls-v2 provider), and specify the --location where your operation information should be stored.

  • The --regions and --zones flags can be omitted when using google-cls-v2

The --regions and --zones flags for dsub specify where the tasks should run. More specifically, this specifies what Compute Engine Zones to use for the VMs that run your tasks.

With the google-v2 provider, there is no default region or zone, and thus one of the --regions or --zones flags is required.

With google-cls-v2, the --location flag defaults to us-central1, and if the --regions and --zones flags are omitted, the location will be used as the default regions list.

dsub features

The following sections show how to run more complex jobs.

Defining what code to run

You can provide a shell command directly in the dsub command-line, as in the hello example above.

You can also save your script to a file, like hello.sh. Then you can run:

dsub \
    ... \
    --script hello.sh

If your script has dependencies that are not stored in your Docker image, you can transfer them to the local disk. See the instructions below for working with input and output files and folders.

Selecting a Docker image

To get started more easily, dsub uses a stock Ubuntu Docker image. This default image may change at any time in future releases, so for reproducible production workflows, you should always specify the image explicitly.

You can change the image by passing the --image flag.

dsub \
    ... \
    --image ubuntu:16.04 \
    --script hello.sh

Note: your --image must include the Bash shell interpreter.

For more information on using the --image flag, see the image section in Scripts, Commands, and Docker

Passing parameters to your script

You can pass environment variables to your script using the --env flag.

dsub \
    ... \
    --env MESSAGE=hello \
    --command 'echo ${MESSAGE}'

The environment variable MESSAGE will be assigned the value hello when your Docker container runs.

Your script or command can reference the variable like any other Linux environment variable, as ${MESSAGE}.

Be sure to enclose your command string in single quotes and not double quotes. If you use double quotes, the command will be expanded in your local shell before being passed to dsub. For more information on using the --command flag, see Scripts, Commands, and Docker

To set multiple environment variables, you can repeat the flag:

--env VAR1=value1 \
--env VAR2=value2

You can also set multiple variables, space-delimited, with a single flag:

--env VAR1=value1 VAR2=value2

Working with input and output files and folders

dsub mimics the behavior of a shared file system using cloud storage bucket paths for input and output files and folders. You specify the cloud storage bucket path. Paths can be:

  • file paths like gs://my-bucket/my-file
  • folder paths like gs://my-bucket/my-folder
  • wildcard paths like gs://my-bucket/my-folder/*

See the inputs and outputs documentation for more details.

Transferring input files to a Google Cloud Storage bucket.

If your script expects to read local input files that are not already contained within your Docker image, the files must be available in Google Cloud Storage.

If your script has dependent files, you can make them available to your script by:

  • Building a private Docker image with the dependent files and publishing the image to a public site, or privately to Google Container Registry
  • Uploading the files to Google Cloud Storage

To upload the files to Google Cloud Storage, you can use the storage browser or gsutil. You can also run on data that’s public or shared with your service account, an email address that you can find in the Google Cloud Console.

Files

To specify input and output files, use the --input and --output flags:

dsub \
    ... \
    --input INPUT_FILE_1=gs://my-bucket/my-input-file-1 \
    --input INPUT_FILE_2=gs://my-bucket/my-input-file-2 \
    --output OUTPUT_FILE=gs://my-bucket/my-output-file \
    --command 'cat "${INPUT_FILE_1}" "${INPUT_FILE_2}" > "${OUTPUT_FILE}"'

In this example:

  • a file will be copied from gs://my-bucket/my-input-file-1 to a path on the data disk
  • the path to the file on the data disk will be set in the environment variable ${INPUT_FILE_1}
  • a file will be copied from gs://my-bucket/my-input-file-2 to a path on the data disk
  • the path to the file on the data disk will be set in the environment variable ${INPUT_FILE_2}

The --command can reference the file paths using the environment variables.

Also in this example:

  • a path on the data disk will be set in the environment variable ${OUTPUT_FILE}
  • the output file will written to the data disk at the location given by ${OUTPUT_FILE}

After the --command completes, the output file will be copied to the bucket path gs://my-bucket/my-output-file

Multiple --input, and --output parameters can be specified and they can be specified in any order.

Folders

To copy folders rather than files, use the --input-recursive and output-recursive flags:

dsub \
    ... \
    --input-recursive FOLDER=gs://my-bucket/my-folder \
    --command 'find ${FOLDER} -name "foo*"'

Multiple --input-recursive, and --output-recursive parameters can be specified and they can be specified in any order.

Mounting "resource data"

While explicitly specifying inputs improves tracking provenance of your data, there are cases where you might not want to expliclty localize all inputs from Cloud Storage to your job VM.

For example, if you have:

  • a large set of resource files
  • your code only reads a subset of those files
  • runtime decisions of which files to read

OR

  • a large input file over which your code makes a single read pass

OR

  • a large input file that your code does not read in its entirety

then you may find it more efficient or convenient to access this data by mounting read-only:

  • a Google Cloud Storage bucket
  • a persistent disk that you pre-create and populate
  • a persistent disk that gets created from a Compute Engine Image that you pre-create.

The google-v2 and google-cls-v2 providers support these methods of providing access to resource data.

The local provider supports mounting a local directory in a similar fashion to support your local development.

Mounting a Google Cloud Storage bucket

To have the google-v2 or google-cls-v2 provider mount a Cloud Storage bucket using Cloud Storage FUSE, use the --mount command line flag:

--mount RESOURCES=gs://mybucket

The bucket will be mounted into the Docker container running your --script or --command and the location made available via the environment variable ${RESOURCES}. Inside your script, you can reference the mounted path using the environment variable. Please read Key differences from a POSIX file system and Semantics before using Cloud Storage FUSE.

Mounting an existing peristent disk

To have the google-v2 or google-cls-v2 provider mount a persistent disk that you have pre-created and populated, use the --mount command line flag and the url of the source disk:

--mount RESOURCES="https://www.googleapis.com/compute/v1/projects/your-project/zones/your_disk_zone/disks/your-disk"
Mounting a persistent disk, created from an image

To have the google-v2 or google-cls-v2 provider mount a persistent disk created from an image, use the --mount command line flag and the url of the source image and the size (in GB) of the disk:

--mount RESOURCES="https://www.googleapis.com/compute/v1/projects/your-project/global/images/your-image 50"

The image will be used to create a new persistent disk, which will be attached to a Compute Engine VM. The disk will mounted into the Docker container running your --script or --command and the location made available by the environment variable ${RESOURCES}. Inside your script, you can reference the mounted path using the environment variable.

To create an image, see Creating a custom image.

Mounting a local directory (local provider)

To have the local provider mount a directory read-only, use the --mount command line flag and a file:// prefix:

--mount RESOURCES=file://path/to/my/dir

The local directory will be mounted into the Docker container running your --scriptor --command and the location made available via the environment variable ${RESOURCES}. Inside your script, you can reference the mounted path using the environment variable.

Setting resource requirements

dsub tasks run using the local provider will use the resources available on your local machine.

dsub tasks run using the google, google-v2, or google-cls-v2 providers can take advantage of a wide range of CPU, RAM, disk, and hardware accelerator (eg. GPU) options.

See the Compute Resources documentation for details.

Job Identifiers

By default, dsub generates a job-id with the form job-name--userid--timestamp where the job-name is truncated at 10 characters and the timestamp is of the form YYMMDD-HHMMSS-XX, unique to hundredths of a second. If you are submitting multiple jobs concurrently, you may still run into situations where the job-id is not unique. If you require a unique job-id for this situation, you may use the --unique-job-id parameter.

If the --unique-job-id parameter is set, job-id will instead be a unique 32 character UUID created by https://docs.python.org/3/library/uuid.html. Because some providers require that the job-id begin with a letter, dsub will replace any starting digit with a letter in a manner that preserves uniqueness.

Submitting a batch job

Each of the examples above has demonstrated submitting a single task with a single set of variables, inputs, and outputs. If you have a batch of inputs and you want to run the same operation over them, dsub allows you to create a batch job.

Instead of calling dsub repeatedly, you can create a tab-separated values (TSV) file containing the variables, inputs, and outputs for each task, and then call dsub once. The result will be a single job-id with multiple tasks. The tasks will be scheduled and run independently, but can be monitored and deleted as a group.

Tasks file format

The first line of the TSV file specifies the names and types of the parameters. For example:

--env SAMPLE_ID<tab>--input VCF_FILE<tab>--output OUTPUT_PATH

Each addition line in the file should provide the variable, input, and output values for each task. Each line beyond the header represents the values for a separate task.

Multiple --env, --input, and --output parameters can be specified and they can be specified in any order. For example:

--env SAMPLE<tab>--input A<tab>--input B<tab>--env REFNAME<tab>--output O
S1<tab>gs://path/A1.txt<tab>gs://path/B1.txt<tab>R1<tab>gs://path/O1.txt
S2<tab>gs://path/A2.txt<tab>gs://path/B2.txt<tab>R2<tab>gs://path/O2.txt

Tasks parameter

Pass the TSV file to dsub using the --tasks parameter. This parameter accepts both the file path and optionally a range of tasks to process. The file may be read from the local filesystem (on the machine you're calling dsub from), or from a bucket in Google Cloud Storage (file name starts with "gs://").

For example, suppose my-tasks.tsv contains 101 lines: a one-line header and 100 lines of parameters for tasks to run. Then:

dsub ... --tasks ./my-tasks.tsv

will create a job with 100 tasks, while:

dsub ... --tasks ./my-tasks.tsv 1-10

will create a job with 10 tasks, one for each of lines 2 through 11.

The task range values can take any of the following forms:

  • m indicates to submit task m (line m+1)
  • m- indicates to submit all tasks starting with task m
  • m-n indicates to submit all tasks from m to n (inclusive).

Logging

The --logging flag points to a location for dsub task log files. For details on how to specify your logging path, see Logging.

Job control

It's possible to wait for a job to complete before starting another. For details, see job control with dsub.

Retries

It is possible for dsub to automatically retry failed tasks. For details, see retries with dsub.

Labeling jobs and tasks

You can add custom labels to jobs and tasks, which allows you to monitor and cancel tasks using your own identifiers. In addition, with the Google providers, labeling a task will label associated compute resources such as virtual machines and disks.

For more details, see Checking Status and Troubleshooting Jobs

Viewing job status

The dstat command displays the status of jobs:

dstat --provider google-v2 --project my-cloud-project

With no additional arguments, dstat will display a list of running jobs for the current USER.

To display the status of a specific job, use the --jobs flag:

dstat --provider google-v2 --project my-cloud-project --jobs job-id

For a batch job, the output will list all running tasks.

Each job submitted by dsub is given a set of metadata values that can be used for job identification and job control. The metadata associated with each job includes:

  • job-name: defaults to the name of your script file or the first word of your script command; it can be explicitly set with the --name parameter.
  • user-id: the USER environment variable value.
  • job-id: identifier of the job, which can be used in calls to dstat and ddel for job monitoring and canceling respectively. See Job Identifiers for more details on the job-id format.
  • task-id: if the job is submitted with the --tasks parameter, each task gets a sequential value of the form "task-n" where n is 1-based.

Note that the job metadata values will be modified to conform with the "Label Restrictions" listed in the Checking Status and Troubleshooting Jobs guide.

Metadata can be used to cancel a job or individual tasks within a batch job.

For more details, see Checking Status and Troubleshooting Jobs

Summarizing job status

By default, dstat outputs one line per task. If you're using a batch job with many tasks then you may benefit from --summary.

$ dstat --provider google-v2 --project my-project --status '*' --summary

Job Name        Status         Task Count
-------------   -------------  -------------
my-job-name     RUNNING        2
my-job-name     SUCCESS        1

In this mode, dstat prints one line per (job name, task status) pair. You can see at a glance how many tasks are finished, how many are still running, and how many are failed/canceled.

Deleting a job

The ddel command will delete running jobs.

By default, only jobs submitted by the current user will be deleted. Use the --users flag to specify other users, or '*' for all users.

To delete a running job:

ddel --provider google-v2 --project my-cloud-project --jobs job-id

If the job is a batch job, all running tasks will be deleted.

To delete specific tasks:

ddel \
    --provider google-v2 \
    --project my-cloud-project \
    --jobs job-id \
    --tasks task-id1 task-id2

To delete all running jobs for the current user:

ddel --provider google-v2 --project my-cloud-project --jobs '*'

Service Accounts and Scope (Google providers only)

When you run the dsub command with the google-v2 or google-cls-v2 provider, there are two different sets of credentials to consider:

  • Account submitting the pipelines.run() request to run your command/script on a VM
  • Account accessing Cloud resources (such as files in GCS) when executing your command/script

The account used to submit the pipelines.run() request is typically your end user credentials. You would have set this up by running:

gcloud auth application-default login

The account used on the VM is a service account. The image below illustrates this:

Pipelines Runner Architecture

By default, dsub will use the default Compute Engine service account as the authorized service account on the VM instance. You can choose to specify the email address of another service acount using --service-account.

By default, dsub will grant the following access scopes to the service account:

In addition, the API will always add this scope:

You can choose to specify scopes using --scopes.

Recommendations for service accounts

While it is straightforward to use the default service account, this account also has broad privileges granted to it by default. Following the Principle of Least Privilege you may want to create and use a service account that has only sufficient privileges granted in order to run your dsub command/script.

To create a new service account, follow the steps below:

  1. Execute the gcloud iam service-accounts create command. The email address of the service account will be [email protected].

     gcloud iam service-accounts create "sa-name"
    
  2. Grant IAM access on buckets, etc. to the service account.

     gsutil iam ch serviceAccount:[email protected]:roles/storage.objectAdmin gs://bucket-name
    
  3. Update your dsub command to include --service-account

     dsub \
       --service-account [email protected]
       ...
    

What next?

More Repositories

1

toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
Python
896
star
2

terra-ui

Web user interface for the Terra platform
TypeScript
52
star
3

leonardo

Notebook service
Scala
44
star
4

terra-docker

Jupyter Notebook
27
star
5

job-manager

Job Manager API and UI for interacting with asynchronous batch jobs and workflows.
TypeScript
26
star
6

terra-interoperability-model

Common data model proposal for biomedical research intended to facilitate and encourage data sharing and reuse
Python
19
star
7

terra-workspace-manager

Java
14
star
8

jade-data-repo

The Terra Data Repository built by the Jade team.
Java
13
star
9

topmed-workflows

a place for topmed workflows
WDL
12
star
10

terra-cli

Java
11
star
11

data-browser

Jupyter Notebook
11
star
12

data-explorer

JavaScript
10
star
13

consent

Broad Institute Data Use Oversight System
Java
9
star
14

data-portal

TypeScript
9
star
15

duos-ui

Broad Institute Data Use Oversight System
JavaScript
9
star
16

consent-ontology

Broad Institute Data Use Oversight System
Java
7
star
17

terra-notebook-utils

Utilities for the Terra notebook environment.
Python
7
star
18

bgzip

Fast streams for block gzip files.
Python
7
star
19

terra-examples

Examples for use in data analysis.
Jupyter Notebook
7
star
20

azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Python
6
star
21

getm

Concurrent reads of HTTP URL addressed data
Python
6
star
22

terra-landing-zone-service

Java
6
star
23

analysis_pipeline_WDL

Collection of WDL workflows based off the University of Washington TOPMed DCC Best Practices for GWAS. The WDL structure was based upon CWLs written by the Seven Bridges development team.
WDL
6
star
24

hca-ingest

Python
5
star
25

terra-app

Shell
5
star
26

FHIR

Broad FHIR is a FHIR Server that powers applications for Genomics research.
JavaScript
5
star
27

sync-github-labels

Synchronize Github issue and PR labels between two repositories
Python
5
star
28

terra-cloud-resource-lib

Java
5
star
29

terra-resource-buffer

Terra Resource Buffering Service
Java
4
star
30

terra-external-credentials-manager

Java
4
star
31

terra-axon-examples

Example notebooks and documentation for working with the Terra Axon UI.
4
star
32

terra-billing-profile-manager

Java
4
star
33

terra-workspace-data-service

Java
4
star
34

terra-java-project-template

Java
4
star
35

herzog

Version control, test, and CI/CD your Python Jupyter notebooks.
Python
4
star
36

jade-data-repo-ui

UI for the Jade Data Repo
TypeScript
4
star
37

data-explorer-indexers

Python
4
star
38

bard

Metrics collection service
JavaScript
3
star
39

consent-data-use

Shell
3
star
40

encode-ingest

Batch ETL pipeline to mirror ENCODE data into the Jade Data Repository.
Scala
3
star
41

welder

Scala
3
star
42

clinvar-ingest

Batch ETL pipeline to mirror ClinVar releases into the Jade Data Repository.
Scala
3
star
43

xsamtools

Lightly modified versions of htslib and bcftools to merge VCF streams.
Python
3
star
44

terra-policy-service

Java
3
star
45

data-store

AWS and GCP data storage system for genomic data.
Python
3
star
46

calhoun

Notebook preview service
Jupyter Notebook
3
star
47

terra-azure-relay-listeners

Java
3
star
48

consent-ui

Broad Institute Data Use Oversight System
HTML
3
star
49

commons-sample-data

A repo to track various TOPMed and other datasets
Python
2
star
50

github-actions

Data Biosphere GitHub Actions
Shell
2
star
51

wdl-conformance-tests

WDL
2
star
52

topmed-workflow-variant-calling

WDL
2
star
53

saturn-ui-prod-deploy

Terra UI automated prod deploy service
JavaScript
2
star
54

cbas

Java
2
star
55

biocore-data-model

BioCore Data Model
Jupyter Notebook
2
star
56

tanagra

Repo for the Tanagra service being developed by the All of Us DRC
Java
2
star
57

terra-aws-resource-discovery

Java
2
star
58

firecloud-app

2
star
59

terra-drs-hub

Java
2
star
60

terra-axon-ui

Repository for the Terra "Axon" UI
TypeScript
2
star
61

kernel-service-poc

Java
2
star
62

featured-notebooks

Python
2
star
63

dos-azul-lambda

Provides access to DSS azul-index in the Data Object Service schemas
Jupyter Notebook
2
star
64

rex

Survey response service
JavaScript
2
star
65

terra-gcs-bq-streaming-functions

Java
2
star
66

stairway

Stairway saga transaction processor library
Java
2
star
67

terra-resource-janitor

Janitor service to cleanup resources created by Cloud Resource Library (CRL)
Java
2
star
68

ssds

Simple data storage system for AWS and GCP
Python
2
star
69

example-authz-registry

example of maintaining authorization list via github + travis + s3
Python
1
star
70

transporter

Bulk file-transfer system for data ingest
Scala
1
star
71

newt-transformer

Amphibious new data transformer to prepare various sources for CGP DSS Data Loader
Python
1
star
72

lyle

Test user allocation service
JavaScript
1
star
73

bard-client

JavaScript library for shared client-side analytics across DSP
JavaScript
1
star
74

bond

Account linking service
Python
1
star
75

env-base

Base framework environment k8s resources
1
star
76

hca-metadata-api

A library for processing HCA metadata programmatically
Python
1
star
77

data-store-auth

data-store authentication infra management
Python
1
star
78

java-pfb

Java
1
star
79

terra-azure-arm-templates

Shell
1
star
80

saturn-terraform

Terraform definitions for all Saturn-managed repos and GCS projects
HCL
1
star
81

metadata-serialization

Serializing metadata for use and reuse ✒️📋🗃️💻
1
star
82

hamm

Cloud Cost Management Service
Scala
1
star
83

wdl-parsers

A package that provides the generated ANTLR4 WDL parsers for Python.
Python
1
star
84

saturn-documentation

1
star
85

bdcat-integration-tests

This supports integration testing between components for the biodata catalyst grant.
Python
1
star
86

terra-folder-manager

Java
1
star
87

hca-import-validation

A utility to validate a staging area before it is imported.
Python
1
star
88

terra-test-runner

Java
1
star
89

terra-data-catalog

Java
1
star
90

data-platforms

Components of the Commons Alliance
Jupyter Notebook
1
star
91

cbas-ui

JavaScript
1
star
92

findable-ui

TypeScript
1
star