• Stars
    star
    220
  • Rank 180,422 (Top 4 %)
  • Language
    Python
  • License
    Other
  • Created over 1 year ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Automated migrations to Unity Catalog

UCX by Dataricks Labs

Your best companion for upgrading to Unity Catalog. It helps you to upgrade all Databricks workspace assets: Legacy Table ACLs, Entitlements, AWS instance profiles, Clusters, Cluster policies, Instance Pools, Databricks SQL warehouses, Delta Live Tables, Jobs, MLflow experiments, MLflow registry, SQL Dashboards & Queries, SQL Alerts, Token and Password usage permissions that are set on the workspace level, Secret scopes, Notebooks, Directories, Repos, Files.

build codecov

See contributing instructions to help improve this project.

Introduction

UCX will guide you, the Databricks customer, through the process of upgrading your account, groups, workspaces, jobs etc. to Unity Catalog.

  1. The upgrade process will first install code, libraries, and workflows into your workspace.
  2. After installation, you will run a series of workflows and examine the output.

UCX leverages Databricks Lakehouse platform to upgrade itself. The upgrade process includes creating jobs, notebooks, and deploying code and configuration files.

By running the installation you install the assessment job and several upgrade jobs. The assessment and upgrade jobs are outlined in the custom-generated README.py that is created by the installer.

The custom-generated README.py, config.yaml, and other assets are placed into your Databricks workspace home folder, into a subfolder named .ucx. See interactive tutorial.

Once the custom Databricks jobs are installed, begin by triggering the assessment job. The assessment job can be found under your workflows or via the active link in the README.py. Once the assessment job is complete, you can review the results in the custom-generated Databricks dashboard (linked to by the custom README.py found in the workspace folder created for you).

You will need an account, unity catalog, and workspace administrative authority to complete the upgrade process. To run the installer, you will need to setup databricks-cli and a credential, following these instructions. Additionally, the interim metadata and config data being processed by UCX will be stored into a Hive Metastore database schema generated at install time.

For questions, troubleshooting or bug fixes, please see your Databricks account team or submit an issue to the Databricks UCX github repo

Installation

Prerequisites

  1. Get trained on UC [free instructor-led training 2x week] [full training schedule]
  2. You will need a desktop computer, running Windows, MacOS, or Linux; This computer is used to install the UCX toolkit onto the Databricks workspace, the computer will also need:
  • Network access to your Databricks Workspace
  • Network access to the Internet to retrieve additional Python packages (e.g. PyYAML, databricks-sdk,...) and access github.com
  • Python 3.10 or later - Windows instructions
  • Databricks CLI with a workspace configuration profile for workspace - instructions
  • Your windows computer will need a shell environment (GitBash or (WSL)
  1. Within the Databricks Workspace you will need:
  • Workspace administrator access permissions
  • The ability for the installer to upload Python Wheel files to DBFS and Workspace FileSystem
  • A PRO or Serverless SQL Warehouse
  • The Assessment workflow will create a legacy "No Isolation Shared" and a legacy "Table ACL" jobs clusters needed to inventory Hive Metastore Table ACLS
  • If your Databricks Workspace relies on an external Hive Metastore (such as glue), make sure to read the External HMS Document.
  1. [AWS] [Azure] [GCP] Account level Identity Setup
  2. [AWS] [Azure] [GCP] Unity Catalog Metastore Created (per region)

Download & Install

We only support installations and upgrades through Databricks CLI, as UCX requires an installation script run to make sure all the necessary and correct configurations are in place.

Installing Databricks CLI on macOS

macos_install_databricks

Install Databricks CLI via curl on Windows

winos_install_databricks

Install UCX

macos_install_ucx

Upgrade UCX

macos_upgrade_ucx

Uninstall UCX

macos_uninstall_ucx

Star History

Star History Chart

Project Support

Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.

More Repositories

1

dolly

Databricksโ€™ Dolly, a large language model trained on the Databricks Machine Learning Platform
Python
10,811
star
2

pyspark-ai

English SDK for Apache Spark
Python
739
star
3

dbx

๐Ÿงฑ Databricks CLI eXtensions - aka dbx is a CLI tool for development and advanced Databricks workflows management.
Python
440
star
4

dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
Python
313
star
5

tempo

API for manipulating time series on top of Apache Spark: lagged time values, rolling statistics (mean, avg, sum, count, etc), AS OF joins, downsampling, and interpolation
Jupyter Notebook
306
star
6

mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
Jupyter Notebook
270
star
7

overwatch

Capture deep metrics on one or all assets within a Databricks workspace
Scala
226
star
8

cicd-templates

Manage your Databricks deployments and CI with code.
Python
202
star
9

automl-toolkit

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.
HTML
191
star
10

migrate

Old scripts for one-off ST-to-E2 migrations. Use "terraform exporter" linked in the readme.
Python
186
star
11

dlt-meta

Metadata driven Databricks Delta Live Tables framework for bronze/silver pipelines
Python
147
star
12

dataframe-rules-engine

Extensible Rules Engine for custom Dataframe / Dataset validation
Scala
134
star
13

discoverx

A Swiss-Army-knife for your Data Intelligence platform administration.
Python
105
star
14

geoscan

Geospatial clustering at massive scale
Scala
94
star
15

jupyterlab-integration

DEPRECATED: Integrating Jupyter with Databricks via SSH
HTML
71
star
16

smolder

HL7 Apache Spark Datasource
Scala
61
star
17

feature-factory

Accelerator to rapidly deploy customized features for your business
Python
55
star
18

databricks-sync

An experimental tool to synchronize source Databricks deployment with a target Databricks deployment.
Python
46
star
19

doc-qa

Python
45
star
20

transpiler

SIEM-to-Spark Transpiler
Scala
42
star
21

brickster

R Toolkit for Databricks
R
40
star
22

delta-oms

DeltaOMS is a solution that help build a centralized repository of Delta Transaction logs and associated operational metrics/statistics for your Delta Lakehouse. Unity Catalog supported in the v0.7.0-rc1 release.Documentation here - https://databrickslabs.github.io/delta-oms/v0.7.0-rc1/
Scala
38
star
23

pytester

Python Testing for Databricks
Python
35
star
24

remorph

Cross-compiler and Data Reconciler into Databricks Lakehouse
Scala
33
star
25

splunk-integration

Databricks Add-on for Splunk
Python
26
star
26

dbignite

Python
24
star
27

arcuate

Delta Sharing + MLflow for ML model & experiment exchange (arcuate delta - a fan shaped river delta)
Python
22
star
28

databricks-sdk-r

Databricks SDK for R (Experimental)
R
19
star
29

tika-ocr

Rich Text Format
17
star
30

sandbox

Experimental or low-maturity things
Go
16
star
31

blueprint

Baseline for Databricks Labs projects written in Python
Python
16
star
32

delta-sharing-java-connector

A Java connector for delta.io/sharing/ that allows you to easily ingest data on any JVM.
Java
13
star
33

partner-connect-api

Scala
12
star
34

pylint-plugin

Databricks Plugin for PyLint
Python
10
star
35

lsql

Lightweight SQL execution wrapper only on top of Databricks SDK
Python
9
star
36

waterbear

Automated provisioning of an industry Lakehouse with enterprise data model
Python
8
star