• Stars
    star
    185
  • Rank 208,271 (Top 5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Databricks SDK for Python (Beta)

Databricks SDK for Python (Beta)

PyPI - Downloads PyPI - License databricks-sdk PyPI codecov

Beta: This SDK is supported for production use cases, but we do expect future releases to have some interface changes; see Interface stability. We are keen to hear feedback from you on these SDKs. Please file issues, and we will address them. | See also the SDK for Java | See also the SDK for Go | See also the Terraform Provider | See also cloud-specific docs (AWS, Azure, GCP)

The Databricks SDK for Python includes functionality to accelerate development with Python for the Databricks Lakehouse. It covers all public Databricks REST API operations. The SDK's internal HTTP client is robust and handles failures on different levels by performing intelligent retries.

Contents

Getting started

  1. Please install Databricks SDK for Python via pip install databricks-sdk and instantiate WorkspaceClient:
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
for c in w.clusters.list():
    print(c.cluster_name)

Databricks SDK for Python is compatible with Python 3.7 (until June 2023), 3.8, 3.9, 3.10, and 3.11.

Code examples

Please checkout custom credentials provider, OAuth with Flask, Last job runs, Starting job and waiting examples. You can also dig deeper into different services, like alerts, billable_usage, catalogs, cluster_policies, clusters, credentials, current_user, dashboards, data_sources, databricks, encryption_keys, experiments, external_locations, git_credentials, global_init_scripts, groups, instance_pools, instance_profiles, ip_access_lists, jobs, libraries, local_browser_oauth.py, log_delivery, metastores, model_registry, networks, permissions, pipelines, private_access, queries, recipients, repos, schemas, secrets, service_principals, storage, storage_credentials, tokens, users, vpc_endpoints, warehouses, workspace, workspace_assignment, workspace_conf, and workspaces.

Authentication

If you use Databricks configuration profiles or Databricks-specific environment variables for Databricks authentication, the only code required to start working with a Databricks workspace is the following code snippet, which instructs the Databricks SDK for Python to use its default authentication flow:

from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
w. # press <TAB> for autocompletion

The conventional name for the variable that holds the workspace-level client of the Databricks SDK for Python is w, which is shorthand for workspace.

In this section

Default authentication flow

If you run the Databricks Terraform Provider, the Databricks SDK for Go, the Databricks CLI, or applications that target the Databricks SDKs for other languages, most likely they will all interoperate nicely together. By default, the Databricks SDK for Python tries the following authentication methods, in the following order, until it succeeds:

  1. Databricks native authentication
  2. Azure native authentication
  3. If the SDK is unsuccessful at this point, it returns an authentication error and stops running.

You can instruct the Databricks SDK for Python to use a specific authentication method by setting the auth_type argument as described in the following sections.

For each authentication method, the SDK searches for compatible authentication credentials in the following locations, in the following order. Once the SDK finds a compatible set of credentials that it can use, it stops searching:

  1. Credentials that are hard-coded into configuration arguments.

    ⚠️ Caution: Databricks does not recommend hard-coding credentials into arguments, as they can be exposed in plain text in version control systems. Use environment variables or configuration profiles instead.

  2. Credentials in Databricks-specific environment variables.

  3. For Databricks native authentication, credentials in the .databrickscfg file's DEFAULT configuration profile from its default file location (~ for Linux or macOS, and %USERPROFILE% for Windows).

  4. For Azure native authentication, the SDK searches for credentials through the Azure CLI as needed.

Depending on the Databricks authentication method, the SDK uses the following information. Presented are the WorkspaceClient and AccountClient arguments (which have corresponding .databrickscfg file fields), their descriptions, and any corresponding environment variables.

Databricks native authentication

By default, the Databricks SDK for Python initially tries Databricks token authentication (auth_type='pat' argument). If the SDK is unsuccessful, it then tries Databricks basic (username/password) authentication (auth_type="basic" argument).

  • For Databricks token authentication, you must provide host and token; or their environment variable or .databrickscfg file field equivalents.
  • For Databricks basic authentication, you must provide host, username, and password (for AWS workspace-level operations); or host, account_id, username, and password (for AWS, Azure, or GCP account-level operations); or their environment variable or .databrickscfg file field equivalents.
Argument Description Environment variable
host (String) The Databricks host URL for either the Databricks workspace endpoint or the Databricks accounts endpoint. DATABRICKS_HOST
account_id (String) The Databricks account ID for the Databricks accounts endpoint. Only has effect when Host is either https://accounts.cloud.databricks.com/ (AWS), https://accounts.azuredatabricks.net/ (Azure), or https://accounts.gcp.databricks.com/ (GCP). DATABRICKS_ACCOUNT_ID
token (String) The Databricks personal access token (PAT) (AWS, Azure, and GCP) or Azure Active Directory (Azure AD) token (Azure). DATABRICKS_TOKEN
username (String) The Databricks username part of basic authentication. Only possible when Host is *.cloud.databricks.com (AWS). DATABRICKS_USERNAME
password (String) The Databricks password part of basic authentication. Only possible when Host is *.cloud.databricks.com (AWS). DATABRICKS_PASSWORD

For example, to use Databricks token authentication:

from databricks.sdk import WorkspaceClient
w = WorkspaceClient(host=input('Databricks Workspace URL: '), token=input('Token: '))

Azure native authentication

By default, the Databricks SDK for Python first tries Azure client secret authentication (auth_type='azure-client-secret' argument). If the SDK is unsuccessful, it then tries Azure CLI authentication (auth_type='azure-cli' argument). See Manage service principals.

The Databricks SDK for Python picks up an Azure CLI token, if you've previously authenticated as an Azure user by running az login on your machine. See Get Azure AD tokens for users by using the Azure CLI.

To authenticate as an Azure Active Directory (Azure AD) service principal, you must provide one of the following. See also Add a service principal to your Azure Databricks account:

  • azure_resource_id, azure_client_secret, azure_client_id, and azure_tenant_id; or their environment variable or .databrickscfg file field equivalents.
  • azure_resource_id and azure_use_msi; or their environment variable or .databrickscfg file field equivalents.
Argument Description Environment variable
azure_resource_id (String) The Azure Resource Manager ID for the Azure Databricks workspace, which is exchanged for a Databricks host URL. DATABRICKS_AZURE_RESOURCE_ID
azure_use_msi (Boolean) true to use Azure Managed Service Identity passwordless authentication flow for service principals. This feature is not yet implemented in the Databricks SDK for Python. ARM_USE_MSI
azure_client_secret (String) The Azure AD service principal's client secret. ARM_CLIENT_SECRET
azure_client_id (String) The Azure AD service principal's application ID. ARM_CLIENT_ID
azure_tenant_id (String) The Azure AD service principal's tenant ID. ARM_TENANT_ID
azure_environment (String) The Azure environment type (such as Public, UsGov, China, and Germany) for a specific set of API endpoints. Defaults to PUBLIC. ARM_ENVIRONMENT

For example, to use Azure client secret authentication:

from databricks.sdk import WorkspaceClient
w = WorkspaceClient(host=input('Databricks Workspace URL: '),
                    azure_workspace_resource_id=input('Azure Resource ID: '),
                    azure_tenant_id=input('AAD Tenant ID: '),
                    azure_client_id=input('AAD Client ID: '),
                    azure_client_secret=input('AAD Client Secret: '))

Please see more examples in this document.

Overriding .databrickscfg

For Databricks native authentication, you can override the default behavior for using .databrickscfg as follows:

Argument Description Environment variable
profile (String) A connection profile specified within .databrickscfg to use instead of DEFAULT. DATABRICKS_CONFIG_PROFILE
config_file (String) A non-default location of the Databricks CLI credentials file. DATABRICKS_CONFIG_FILE

For example, to use a profile named MYPROFILE instead of DEFAULT:

from databricks.sdk import WorkspaceClient
w = WorkspaceClient(profile='MYPROFILE')
# Now call the Databricks workspace APIs as desired...

Additional authentication configuration options

For all authentication methods, you can override the default behavior in client arguments as follows:

Argument Description Environment variable
auth_type (String) When multiple auth attributes are available in the environment, use the auth type specified by this argument. This argument also holds the currently selected auth. DATABRICKS_AUTH_TYPE
http_timeout_seconds (Integer) Number of seconds for HTTP timeout. Default is 60. (None)
retry_timeout_seconds (Integer) Number of seconds to keep retrying HTTP requests. Default is 300 (5 minutes). (None)
debug_truncate_bytes (Integer) Truncate JSON fields in debug logs above this limit. Default is 96. DATABRICKS_DEBUG_TRUNCATE_BYTES
debug_headers (Boolean) true to debug HTTP headers of requests made by the application. Default is false, as headers contain sensitive data, such as access tokens. DATABRICKS_DEBUG_HEADERS
rate_limit (Integer) Maximum number of requests per second made to Databricks REST API. DATABRICKS_RATE_LIMIT

For example, to turn on debug HTTP headers:

from databricks.sdk import WorkspaceClient
w = WorkspaceClient(debug_headers=True)
# Now call the Databricks workspace APIs as desired...

Code examples

To find code examples that demonstrate how to call the Databricks SDK for Python, see the top-level examples folder within this repository

Long-running operations

When you invoke a long-running operation, the SDK provides a high-level API to trigger these operations and wait for the related entities to reach the correct state or return the error message in case of failure. All long-running operations return generic Wait instance with result() method to get a result of long-running operation, once it's finished. Databricks SDK for Python picks the most reasonable default timeouts for every method, but sometimes you may find yourself in a situation, where you'd want to provide datetime.timedelta() as the value of timeout argument to result() method.

There are a number of long-running operations in Databricks APIs such as managing:

  • Clusters,
  • Command execution
  • Jobs
  • Libraries
  • Delta Live Tables pipelines
  • Databricks SQL warehouses.

For example, in the Clusters API, once you create a cluster, you receive a cluster ID, and the cluster is in the PENDING state Meanwhile Databricks takes care of provisioning virtual machines from the cloud provider in the background. The cluster is only usable in the RUNNING state and so you have to wait for that state to be reached.

Another example is the API for running a job or repairing the run: right after the run starts, the run is in the PENDING state. The job is only considered to be finished when it is in either the TERMINATED or SKIPPED state. Also you would likely need the error message if the long-running operation times out and fails with an error code. Other times you may want to configure a custom timeout other than the default of 20 minutes.

In the following example, w.clusters.create returns ClusterInfo only once the cluster is in the RUNNING state, otherwise it will timeout in 10 minutes:

import datetime
import logging
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
info = w.clusters.create_and_wait(cluster_name='Created cluster',
                                  spark_version='12.0.x-scala2.12',
                                  node_type_id='m5d.large',
                                  autotermination_minutes=10,
                                  num_workers=1,
                                  timeout=datetime.timedelta(minutes=10))
logging.info(f'Created: {info}')

Please look at the examples/starting_job_and_waiting.py for a more advanced usage:

import datetime
import logging
import time

from databricks.sdk import WorkspaceClient
import databricks.sdk.service.jobs as j

w = WorkspaceClient()

# create a dummy file on DBFS that just sleeps for 10 seconds
py_on_dbfs = f'/home/{w.current_user.me().user_name}/sample.py'
with w.dbfs.open(py_on_dbfs, write=True, overwrite=True) as f:
    f.write(b'import time; time.sleep(10); print("Hello, World!")')

# trigger one-time-run job and get waiter object
waiter = w.jobs.submit(run_name=f'py-sdk-run-{time.time()}', tasks=[
    j.RunSubmitTaskSettings(
        task_key='hello_world',
        new_cluster=j.BaseClusterInfo(
            spark_version=w.clusters.select_spark_version(long_term_support=True),
            node_type_id=w.clusters.select_node_type(local_disk=True),
            num_workers=1
        ),
        spark_python_task=j.SparkPythonTask(
            python_file=f'dbfs:{py_on_dbfs}'
        ),
    )
])

logging.info(f'starting to poll: {waiter.run_id}')

# callback, that receives a polled entity between state updates
def print_status(run: j.Run):
    statuses = [f'{t.task_key}: {t.state.life_cycle_state}' for t in run.tasks]
    logging.info(f'workflow intermediate status: {", ".join(statuses)}')

# If you want to perform polling in a separate thread, process, or service,
# you can use w.jobs.wait_get_run_job_terminated_or_skipped(
#   run_id=waiter.run_id,
#   timeout=datetime.timedelta(minutes=15),
#   callback=print_status) to achieve the same results.
#
# Waiter interface allows for `w.jobs.submit(..).result()` simplicity in
# the scenarios, where you need to block the calling thread for the job to finish.
run = waiter.result(timeout=datetime.timedelta(minutes=15),
                    callback=print_status)

logging.info(f'job finished: {run.run_page_url}')

Paginated responses

On the platform side the Databricks APIs have different wait to deal with pagination:

  • Some APIs follow the offset-plus-limit pagination
  • Some start their offsets from 0 and some from 1
  • Some use the cursor-based iteration
  • Others just return all results in a single response

The Databricks SDK for Python hides this complexity under Iterator[T] abstraction, where multi-page results yield items. Python typing helps to auto-complete the individual item fields.

import logging
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
for repo in w.repos.list():
    logging.info(f'Found repo: {repo.path}')

Please look at the examples/last_job_runs.py for a more advanced usage:

import logging
from collections import defaultdict
from datetime import datetime, timezone
from databricks.sdk import WorkspaceClient

latest_state = {}
all_jobs = {}
durations = defaultdict(list)

w = WorkspaceClient()
for job in w.jobs.list():
    all_jobs[job.job_id] = job
    for run in w.jobs.list_runs(job_id=job.job_id, expand_tasks=False):
        durations[job.job_id].append(run.run_duration)
        if job.job_id not in latest_state:
            latest_state[job.job_id] = run
            continue
        if run.end_time < latest_state[job.job_id].end_time:
            continue
        latest_state[job.job_id] = run

summary = []
for job_id, run in latest_state.items():
    summary.append({
        'job_name': all_jobs[job_id].settings.name,
        'last_status': run.state.result_state,
        'last_finished': datetime.fromtimestamp(run.end_time/1000, timezone.utc),
        'average_duration': sum(durations[job_id]) / len(durations[job_id])
    })

for line in sorted(summary, key=lambda s: s['last_finished'], reverse=True):
    logging.info(f'Latest: {line}')

Single-Sign-On (SSO) with OAuth

Authorization Code flow with PKCE

For a regular web app running on a server, it's recommended to use the Authorization Code Flow to obtain an Access Token and a Refresh Token. This method is considered safe because the Access Token is transmitted directly to the server hosting the app, without passing through the user's web browser and risking exposure.

To enhance the security of the Authorization Code Flow, the PKCE (Proof Key for Code Exchange) mechanism can be employed. With PKCE, the calling application generates a secret called the Code Verifier, which is verified by the authorization server. The app also creates a transform value of the Code Verifier, called the Code Challenge, and sends it over HTTPS to obtain an Authorization Code. By intercepting the Authorization Code, a malicious attacker cannot exchange it for a token without possessing the Code Verifier.

The presented sample is a Python3 script that uses the Flask web framework along with Databricks SDK for Python to demonstrate how to implement the OAuth Authorization Code flow with PKCE security. It can be used to build an app where each user uses their identity to access Databricks resources. The script can be executed with or without client and secret credentials for a custom OAuth app.

Databricks SDK for Python exposes the oauth_client.initiate_consent() helper to acquire user redirect URL and initiate PKCE state verification. Application developers are expected to persist RefreshableCredentials in the webapp session and restore it via RefreshableCredentials.from_dict(oauth_client, session['creds']) helpers.

Works for both AWS and Azure. Not supported for GCP at the moment.

from databricks.sdk.oauth import OAuthClient

oauth_client = OAuthClient(host='<workspace-url>',
                           client_id='<oauth client ID>',
                           redirect_url=f'http://host.domain/callback',
                           scopes=['clusters'])

import secrets
from flask import Flask, render_template_string, request, redirect, url_for, session

APP_NAME = 'flask-demo'
app = Flask(APP_NAME)
app.secret_key = secrets.token_urlsafe(32)


@app.route('/callback')
def callback():
   from databricks.sdk.oauth import Consent
   consent = Consent.from_dict(oauth_client, session['consent'])
   session['creds'] = consent.exchange_callback_parameters(request.args).as_dict()
   return redirect(url_for('index'))


@app.route('/')
def index():
   if 'creds' not in session:
      consent = oauth_client.initiate_consent()
      session['consent'] = consent.as_dict()
      return redirect(consent.auth_url)

   from databricks.sdk import WorkspaceClient
   from databricks.sdk.oauth import SessionCredentials

   credentials_provider = SessionCredentials.from_dict(oauth_client, session['creds'])
   workspace_client = WorkspaceClient(host=oauth_client.host,
                                      product=APP_NAME,
                                      credentials_provider=credentials_provider)

   return render_template_string('...', w=workspace_client)

SSO for local scripts on development machines

For applications, that do run on developer workstations, Databricks SDK for Python provides auth_type='external-browser' utility, that opens up a browser for a user to go through SSO flow. Azure support is still in the early experimental stage.

from databricks.sdk import WorkspaceClient

host = input('Enter Databricks host: ')

w = WorkspaceClient(host=host, auth_type='external-browser')
clusters = w.clusters.list()

for cl in clusters:
    print(f' - {cl.cluster_name} is {cl.state}')

Creating custom OAuth applications

In order to use OAuth with Databricks SDK for Python, you should use account_client.custom_app_integration.create API.

import logging, getpass
from databricks.sdk import AccountClient
account_client = AccountClient(host='https://accounts.cloud.databricks.com',
                               account_id=input('Databricks Account ID: '),
                               username=input('Username: '),
                               password=getpass.getpass('Password: '))

logging.info('Enrolling all published apps...')
account_client.o_auth_enrollment.create(enable_all_published_apps=True)

status = account_client.o_auth_enrollment.get()
logging.info(f'Enrolled all published apps: {status}')

custom_app = account_client.custom_app_integration.create(
    name='awesome-app',
    redirect_urls=[f'https://host.domain/path/to/callback'],
    confidential=True)
logging.info(f'Created new custom app: '
             f'--client_id {custom_app.client_id} '
             f'--client_secret {custom_app.client_secret}')

Logging

The Databricks SDK for Python seamlessly integrates with the standard Logging facility for Python. This allows developers to easily enable and customize logging for their Databricks Python projects. To enable debug logging in your Databricks Python project, you can follow the example below:

import logging, sys
logging.basicConfig(stream=sys.stderr,
                    level=logging.INFO,
                    format='%(asctime)s [%(name)s][%(levelname)s] %(message)s')
logging.getLogger('databricks.sdk').setLevel(logging.DEBUG)

from databricks.sdk import WorkspaceClient
w = WorkspaceClient(debug_truncate_bytes=1024, debug_headers=False)
for cluster in w.clusters.list():
    logging.info(f'Found cluster: {cluster.cluster_name}')

In the above code snippet, the logging module is imported and the basicConfig() method is used to set the logging level to DEBUG. This will enable logging at the debug level and above. Developers can adjust the logging level as needed to control the verbosity of the logging output. The SDK will log all requests and responses to standard error output, using the format > for requests and < for responses. In some cases, requests or responses may be truncated due to size considerations. If this occurs, the log message will include the text ... (XXX additional elements) to indicate that the request or response has been truncated. To increase the truncation limits, developers can set the debug_truncate_bytes configuration property or the DATABRICKS_DEBUG_TRUNCATE_BYTES environment variable. To protect sensitive data, such as authentication tokens, passwords, or any HTTP headers, the SDK will automatically replace these values with **REDACTED** in the log output. Developers can disable this redaction by setting the debug_headers configuration property to True.

2023-03-22 21:19:21,702 [databricks.sdk][DEBUG] GET /api/2.0/clusters/list
< 200 OK
< {
<   "clusters": [
<     {
<       "autotermination_minutes": 60,
<       "cluster_id": "1109-115255-s1w13zjj",
<       "cluster_name": "DEFAULT Test Cluster",
<       ... truncated for brevity
<     },
<     "... (47 additional elements)"
<   ]
< }

Overall, the logging capabilities provided by the Databricks SDK for Python can be a powerful tool for monitoring and troubleshooting your Databricks Python projects. Developers can use the various logging methods and configuration options provided by the SDK to customize the logging output to their specific needs.

Interaction with dbutils

You can use the client-side implementation of dbutils by accessing dbutils property on the WorkspaceClient. Most of the dbutils.fs operations and dbutils.secrets are implemented natively in Python within Databricks SDK. Non-SDK implementations still require a Databricks cluster, that you have to specify through the cluster_id configuration attribute or DATABRICKS_CLUSTER_ID environment variable. Don't worry if cluster is not running: internally, Databricks SDK for Python calls w.clusters.ensure_cluster_is_running().

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
dbutils = w.dbutils

files_in_root = dbutils.fs.ls('/')
print(f'number of files in root: {len(files_in_root)}')

Alternatively, you can import dbutils from databricks.sdk.runtime module, but you have to make sure that all configuration is already present in the environment variables:

from databricks.sdk.runtime import dbutils

for secret_scope in dbutils.secrets.listScopes():
    for secret_metadata in dbutils.secrets.list(secret_scope.name):
        print(f'found {secret_metadata.key} secret in {secret_scope.name} scope')

Interface stability

Databricks is actively working on stabilizing the Databricks SDK for Python's interfaces. API clients for all services are generated from specification files that are synchronized from the main platform. You are highly encouraged to pin the exact dependency version and read the changelog where Databricks documents the changes. Databricks may have minor documented backward-incompatible changes, such as renaming some type names to bring more consistency.

More Repositories

1

learning-spark

Example code from Learning Spark book
Java
3,864
star
2

koalas

Koalas: pandas API on Apache Spark
Python
3,330
star
3

Spark-The-Definitive-Guide

Spark: The Definitive Guide's Code Repository
Scala
2,678
star
4

scala-style-guide

Databricks Scala Coding Style Guide
2,673
star
5

spark-deep-learning

Deep Learning Pipelines for Apache Spark
Python
1,993
star
6

click

The "Command Line Interactive Controller for Kubernetes"
Rust
1,416
star
7

LearningSparkV2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Scala
1,158
star
8

megablocks

Python
1,147
star
9

spark-sklearn

(Deprecated) Scikit-learn integration package for Apache Spark
Python
1,080
star
10

spark-csv

CSV Data Source for Apache Spark 1.x
Scala
1,051
star
11

tensorframes

[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark
Scala
749
star
12

devrel

This repository contains the notebooks and presentations we use for our Databricks Tech Talks
HTML
687
star
13

reference-apps

Spark reference applications
Scala
648
star
14

spark-redshift

Redshift data source for Apache Spark
Scala
598
star
15

spark-sql-perf

Scala
543
star
16

spark-avro

Avro Data Source for Apache Spark
Scala
538
star
17

spark-xml

XML data source for Spark SQL and DataFrames
Scala
501
star
18

spark-corenlp

Stanford CoreNLP wrapper for Apache Spark
Scala
424
star
19

mlops-stacks

This repo provides a customizable stack for starting new ML projects on Databricks that follow production best-practices out of the box.
Python
415
star
20

spark-training

Apache Spark training material
Scala
396
star
21

databricks-cli

(Legacy) Command Line Interface for Databricks
Python
384
star
22

spark-perf

Performance tests for Apache Spark
Scala
372
star
23

delta-live-tables-notebooks

Python
334
star
24

terraform-provider-databricks

Databricks Terraform Provider
Go
333
star
25

spark-knowledgebase

Spark Knowledge Base
328
star
26

databricks-ml-examples

Python
284
star
27

sjsonnet

Scala
266
star
28

jsonnet-style-guide

Databricks Jsonnet Coding Style Guide
205
star
29

dbt-databricks

A dbt adapter for Databricks.
Python
199
star
30

containers

Sample base images for Databricks Container Services
Dockerfile
163
star
31

databricks-sql-python

Databricks SQL Connector for Python
Python
158
star
32

sbt-spark-package

Sbt plugin for Spark packages
Scala
150
star
33

notebook-best-practices

An example showing how to apply software engineering best practices to Databricks notebooks.
Python
116
star
34

databricks-vscode

VS Code extension for Databricks
TypeScript
114
star
35

benchmarks

A place in which we publish scripts for reproducible benchmarks.
Python
106
star
36

terraform-databricks-examples

Examples of using Terraform to deploy Databricks resources
HCL
103
star
37

spark-tfocs

A Spark port of TFOCS: Templates for First-Order Conic Solvers (cvxr.com/tfocs)
Scala
88
star
38

intellij-jsonnet

Intellij Jsonnet Plugin
Java
82
star
39

sbt-databricks

An sbt plugin for deploying code to Databricks Cloud
Scala
71
star
40

terraform-databricks-lakehouse-blueprints

Set of Terraform automation templates and quickstart demos to jumpstart the design of a Lakehouse on Databricks. This project has incorporated best practices across the industries we work with to deliver composable modules to build a workspace to comply with the highest platform security and governance standards.
Python
71
star
41

spark-integration-tests

Integration tests for Spark
Scala
68
star
42

genai-cookbook

Python
63
star
43

spark-pr-dashboard

Dashboard to aid in Spark pull request reviews
JavaScript
54
star
44

run-notebook

TypeScript
47
star
45

ide-best-practices

Best practices for working with Databricks from an IDE
Python
47
star
46

unity-catalog-setup

Notebooks, terraform, tools to enable setting up Unity Catalog
45
star
47

simr

Spark In MapReduce (SIMR) - launching Spark applications on existing Hadoop MapReduce infrastructure
Java
44
star
48

devbox

Scala
38
star
49

databricks-sql-go

Golang database/sql driver for Databricks SQL.
Go
35
star
50

diviner

Grouped time series forecasting engine
Python
33
star
51

cli

Databricks CLI
Go
32
star
52

tmm

Python
30
star
53

security-bucket-brigade

JavaScript
30
star
54

databricks-sdk-go

Databricks SDK for Go
Go
29
star
55

pig-on-spark

proof-of-concept implementation of Pig-on-Spark integrated at the logical node level
Scala
28
star
56

databricks-sql-cli

CLI for querying Databricks SQL
Python
27
star
57

automl

Python
26
star
58

databricks-sql-nodejs

Databricks SQL Connector for Node.js
TypeScript
24
star
59

tpch-dbgen

Patched version of dbgen
C
22
star
60

als-benchmark-scripts

Scripts to benchmark distributed Alternative Least Squares (ALS)
Scala
22
star
61

spark-package-cmd-tool

A command line tool for Spark packages
Python
18
star
62

congruity

The goal of this library is to provide a compatibility layer that makes it easier to adopt Spark Connect. The library is designed to be simply imported in your application and will then monkey-patch the existing API to provide the legacy functionality.
Python
16
star
63

python-interview

Databricks Python interview setup instructions
15
star
64

xgb-regressor

MLflow XGBoost Regressor
Python
15
star
65

databricks-accelerators

Accelerate the use of Databricks for customers [public repo]
Python
15
star
66

tableau-connector

Scala
12
star
67

files_in_repos

Python
12
star
68

upload-dbfs-temp

TypeScript
12
star
69

spark-sklearn-docs

HTML
11
star
70

sqltools-databricks-driver

SQLTools driver for Databricks SQL
TypeScript
11
star
71

genomics-pipelines

secondary analysis pipelines parallelized with apache spark
Scala
10
star
72

workflows-examples

10
star
73

databricks-sdk-java

Databricks SDK for Java
Java
10
star
74

dais-cow-bff

Code for the "Path to Production" DAIS 2024 and 2023 talks
Jupyter Notebook
8
star
75

xgboost-linux64

Databricks Private xgboost Linux64 fork
C++
8
star
76

mlflow-example-sklearn-elasticnet-wine

Jupyter Notebook
7
star
77

databricks-ttyd

C
6
star
78

setup-cli

Sets up the Databricks CLI in your GitHub Actions workflow.
Shell
4
star
79

terraform-databricks-mlops-aws-project

This module creates and configures service principals with appropriate permissions and entitlements to run CI/CD for a project, and creates a workspace directory as a container for project-specific resources for the Databricks AWS staging and prod workspaces.
HCL
4
star
80

jenkins-job-builder

Fork of https://docs.openstack.org/infra/jenkins-job-builder/ to include unmerged patches
Python
4
star
81

terraform-databricks-mlops-azure-project-with-sp-creation

This module creates and configures service principals with appropriate permissions and entitlements to run CI/CD for a project, and creates a workspace directory as a container for project-specific resources for the Azure Databricks staging and prod workspaces. It also creates the relevant Azure Active Directory (AAD) applications for the service principals.
HCL
4
star
82

terraform-databricks-sra

The Security Reference Architecture (SRA) implements typical security features as Terraform Templates that are deployed by most high-security organizations, and enforces controls for the largest risks that customers ask about most often.
HCL
4
star
83

databricks-empty-ide-project

Empty IDE project used by the VSCode extension for Databricks
3
star
84

databricks-repos-proxy

Python
2
star
85

databricks-asset-bundles-dais2023

Python
2
star
86

pex

Fork of pantsbuild/pex with a few Databricks-specific changes
Python
2
star
87

SnpEff

Databricks snpeff fork
Java
2
star
88

notebook_gallery

Jupyter Notebook
2
star
89

terraform-databricks-mlops-aws-infrastructure

This module sets up multi-workspace model registry between a Databricks AWS development (dev) workspace, staging workspace, and production (prod) workspace, allowing READ access from dev/staging workspaces to staging & prod model registries.
HCL
2
star
90

expectations

Python
1
star
91

homebrew-tap

Homebrew Tap for the Databricks CLI
Ruby
1
star
92

terraform-databricks-mlops-azure-infrastructure-with-sp-creation

This module sets up multi-workspace model registry between an Azure Databricks development (dev) workspace, staging workspace, and production (prod) workspace, allowing READ access from dev/staging workspaces to staging & prod model registries. It also creates the relevant Azure Active Directory (AAD) applications for the service principals.
HCL
1
star
93

mfg_dlt_workshop

DLT Manufacturing Workshop
Python
1
star
94

databricks-dbutils-scala

The Scala SDK for Databricks.
Scala
1
star
95

kdd24-forecasting-anomaly-detection

Python
1
star
96

terraform-databricks-mlops-azure-project-with-sp-linking

This module creates and configures service principals with appropriate permissions and entitlements to run CI/CD for a project, and creates a workspace directory as a container for project-specific resources for the Azure Databricks staging and prod workspaces. It also links pre-existing Azure Active Directory (AAD) applications to the service principals.
HCL
1
star
97

terraform-databricks-mlops-azure-infrastructure-with-sp-linking

This module sets up multi-workspace model registry between an Azure Databricks development (dev) workspace, staging workspace, and production (prod) workspace, allowing READ access from dev/staging workspaces to staging & prod model registries. It also links pre-existing Azure Active Directory (AAD) applications to the service principals.
HCL
1
star