• Stars
    star
    687
  • Rank 65,799 (Top 2 %)
  • Language
    HTML
  • Created over 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This repository contains the notebooks and presentations we use for our Databricks Tech Talks

tech-talks

This repository contains the notebooks and presentations we use for our Databricks Tech Talks.

You can find links to the tech talks below as well as the notebooks for these sessions directly in the repo.

Sections

Upcoming-Tech-Talks

2020-04-29 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Introduction to Apache Spark

This workshop covers the fundamentals of Apache Spark, the most popular big data processing engine. In this workshop, you will learn how to ingest data with Spark, analyze the Spark UI, and gain a better understanding of distributed computing. We will be using data released by the Johns Hopkins Center for Systems Science and Engineering (CSSE) Novel Coronavirus (COVID-19). Prior basic Python experience is recommended.


2020-04-30 Using Delta as a Change Data Capture Source

While it is common to use Delta Lake as a sink for change data captured from traditional data sources; customers are increasingly asking how to use Delta tables as a source for a change data capture (CDC) process. To state a different way, how can we read a stream of changes from a Delta table, so that they can be propagated downstream. In each of these cases, we want to capture a change stream from a Delta table and send it somewhere for further processing. In this session, we will discuss the architecture, use cases, and solutions.


Featured

Notebook | Johns Hopkins CSSE COVID-19 Analysis

This notebook processes and performs quick analysis from the 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE (https://github.com/CSSEGISandData/COVID-19). The data is updated in the `/databricks-datasets/COVID/CSSEGISandData/` location regularly so you can access the data directly. The following animated GIF shows the COVID-19 confirmed cases and deaths per 100K people per the Johns Hopkins CSSE dataset spanning March 22nd to April 14th 2020.


Notebook | NY Times COVID-19 Analysis

This notebook processes and performs quick analysis from the NY Times COVID-19 dataset (https://github.com/nytimes/covid-19-data). The data is updated in the `/databricks-datasets/COVID/covid-19-data/` location regularly so you can access the data directly. The following animated GIFs shows the COVID-19 confirmed cases and deaths per 100K people from the NY Times dataset spanning two week window around when educational facilities were closed for Washington (3/13) and New York (3/18) states .


Previous-Tech-Talks

2020-04-23 Predictive Maintenance (PdM) on IoT Data for Early Fault Detection w/ Delta Lake

Predictive Maintenance (PdM) is different from other routine or time-based maintenance approaches as it combines various sensor readings and sophisticated analytics on thousands of logged events in near real time and promises several fold improvements in cost savings because tasks are performed only when warranted. The collaborative Data and Analytics platform from Databricks is a great technology fit to facilitate these use cases by providing a single unified platform to ingest the sensor data, perform the necessary transformations and exploration, run ML and generate valuable insights.

2020-04-22 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Machine Learning with scikit-learn

scikit-learn is one of the most popular open-source machine learning libraries among data science practitioners. This workshop will walk through what machine learning is, the different types of machine learning, and how to build a simple machine learning model. This workshop focuses on the techniques of applying and evaluating machine learning methods, rather than the statistical concepts behind them. We will be using data released by the Johns Hopkins Center for Systems Science and Engineering (CSSE) Novel Coronavirus (COVID-19). Prior basic Python experience is recommended.

2020-04-16 - Diving into Delta Lake: DML Internals

In the earlier Delta Lake Internals webinar series sessions, we described how the Delta Lake transaction log works. In this session, we will dive deeper into how commits, snapshot isolation, and partition and files change when performing deletes, updates, merges, and structured streaming.

2020-04-15 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Data Analysis with Pandas

This workshop is on pandas, a powerful open-source Python package for data analysis and manipulation. In this workshop, you will learn how to read data, compute summary statistics, check data distributions, conduct basic data cleaning and transformation, and plot simple visualizations. We will be using data released by the Johns Hopkins Center for Systems Science and Engineering (CSSE) Novel Coronavirus (COVID-19). Prior basic Python experience is recommended.

2020-04-08 - Workshop | Introduction to Data Analysis for Aspiring Data Scientists: Introduction to Python on Databricks

Python is a popular programming language because of its wide applications including but not limited to data analysis, machine learning, and web development. This workshop covers major foundational concepts necessary for you to start coding in Python, with a focus on data analysis. You will learn about different types of variables, for loops, functions, and conditional statements. No prior programming knowledge is required.

2020-04-02 - Diving into Delta Lake: Enforcing and Evolving Schema

As business problems and requirements evolve over time, so too does the structure of your data. With Delta Lake, as the data changes, incorporating new dimensions is easy. Users have access to simple semantics to control the schema of their tables. These tools include schema enforcement, which prevents users from accidentally polluting their tables with mistakes or garbage data, as well as schema evolution, which enables them to automatically add new columns of rich data when those columns belong. In this webinar, we’ll dive into the use of these tools.

2020-03-26 - Diving into Delta Lake: Unpacking the Transaction Log

The transaction log is key to understanding Delta Lake because it is the common thread that runs through many of its most important features, including ACID transactions, scalable metadata handling, time travel, and more. In this session, we’ll explore what the Delta Lake transaction log is, how it works at the file level, and how it offers an elegant solution to the problem of multiple concurrent reads and writes.

2020-03-19 - Analyzing COVID-19: Can the Data Community Help?

With the current concerns over SARS-Cov-2 and COVID-19, there are now various COVID-19 datasets on Kaggle and GitHub, competitions such as the COVID-19 Open Research Dataset Challenge (CORD-19), and models such as University of Washington’s Institute for Health Metrics and Evaluation (IHME) COVID-19 Projections. Whether you are a student or a professional data scientist, we thought we could help out by providing educational sessions on how to analyze these datasets.

2020-03-19 - Machine Learning Lessons Learned from the Field: Interview with Brooke Wenig

Developer Advocate Denny Lee will interview Brooke Wenig, Machine Learning Practice Lead, on the best practices and patterns when developing, training, and deploying Machine Learning algorithms in production.

2020-03-12 - Simplify and Scale Data Engineering Pipelines with Delta Lake

A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (β€œBronze” tables), transformation/feature engineering (β€œSilver” tables), and machine learning training or prediction (β€œGold” tables). Combined, we refer to these tables as a β€œmulti-hop” architecture. It allows data engineers to build a pipeline that begins with raw data as a β€œsingle source of truth” from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake.

2020-03-05 - Beyond Lambda: Introducing Delta Architecture

Lambda architecture is a popular technique where records are processed by a batch system and streaming system in parallel. With the advent of Delta Lake, we are seeing a lot of our customers adopting a simple continuous data flow model to process data as it arrives. We call this architecture, The Delta Architecture. In this session, we cover the major bottlenecks for adopting a continuous data flow model and how the Delta architecture solves those problems.

2020-02-27 - Getting Data Ready for Data Science with Delta Lake and MLflow

One must take a holistic view of the entire data analytics realm when it comes to planning for data science initiatives. Data engineering is a key enabler of data science helping furnish reliable, quality data in a timely fashion. Delta Lake, an open-source storage layer that brings reliability to data lakes can help take your data reliability to the next level.

2020-02-19 - The Genesis of Delta Lake - An Interview with Burak Yavuz

New decade, new start! Let's kick off 2020 with our first online meetup of the year featuring Burak Yavuz, Software Engineer at Databricks, for a talk about the genesis of Delta Lake. Developer Advocate Denny Lee will interview Burak Yavuz to learn about the Delta Lake team's decision making process and why they designed, architected, and implemented the architecture that it is today. Understand technical challenges that the team faced, how those challenges were solved, and learn about the plans for the future.

COVID-19-Samples

This section contains links to COVID-19 sample datasets and notebooks

Datasets

/databricks-datasets/[location] Resource
/../COVID/CORD-19/ COVID-19 Open Research Dataset Challenge (CORD-19)
/../COVID/CSSEGISandData/ 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE
/../COVID/ESRI_hospital_beds/ Definitive Healthcare: USA Hospital Beds
/../COVID/IHME/ IHME (UW) COVID-19 Projections
/../COVID/USAFacts/ USA Facts: Confirmed | Deaths
/../COVID/coronavirusdataset/ Data Science for COVID-19 (DS4C) (South Korea)
/../COVID/covid-19-data/ NY Times COVID-19 Datasets

Notebooks

Notebooks Description Datasets Used
Load JSON Datasets Loading CORD-19 JSON Datasets COVID-19 Open Research Dataset Challenge (CORD-19)
Analyzing CORD-19 Datasets Exploratory Data Analysis of the CORD-19 dataset COVID-19 Open Research Dataset Challenge (CORD-19)
NLP - Exploring CV19 Literature Exploring CORD-19 Literature using NLP COVID-19 Open Research Dataset Challenge (CORD-19)
South Korea COVID-19 Analysis Exploratory Data Analysis of the South Korea COVID-19 dataset Data Science for COVID-19 (DS4C) (South Korea)
Johns Hopkins COVID-19 Analysis Exploratory Data Analysis of the Johns Hopkins CSSE COVID-19 dataset 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE
NY Times COVID-19 Analysis Exploratory Data Analysis of the NY Times COVID-19 dataset NY Times COVID-19 Datasets

More Repositories

1

learning-spark

Example code from Learning Spark book
Java
3,864
star
2

koalas

Koalas: pandas API on Apache Spark
Python
3,330
star
3

Spark-The-Definitive-Guide

Spark: The Definitive Guide's Code Repository
Scala
2,678
star
4

scala-style-guide

Databricks Scala Coding Style Guide
2,673
star
5

spark-deep-learning

Deep Learning Pipelines for Apache Spark
Python
1,993
star
6

click

The "Command Line Interactive Controller for Kubernetes"
Rust
1,416
star
7

LearningSparkV2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Scala
1,158
star
8

megablocks

Python
1,147
star
9

spark-sklearn

(Deprecated) Scikit-learn integration package for Apache Spark
Python
1,080
star
10

spark-csv

CSV Data Source for Apache Spark 1.x
Scala
1,051
star
11

tensorframes

[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark
Scala
749
star
12

reference-apps

Spark reference applications
Scala
648
star
13

spark-redshift

Redshift data source for Apache Spark
Scala
598
star
14

spark-sql-perf

Scala
543
star
15

spark-avro

Avro Data Source for Apache Spark
Scala
538
star
16

spark-xml

XML data source for Spark SQL and DataFrames
Scala
501
star
17

spark-corenlp

Stanford CoreNLP wrapper for Apache Spark
Scala
424
star
18

mlops-stacks

This repo provides a customizable stack for starting new ML projects on Databricks that follow production best-practices out of the box.
Python
415
star
19

spark-training

Apache Spark training material
Scala
396
star
20

databricks-cli

(Legacy) Command Line Interface for Databricks
Python
384
star
21

spark-perf

Performance tests for Apache Spark
Scala
372
star
22

delta-live-tables-notebooks

Python
334
star
23

terraform-provider-databricks

Databricks Terraform Provider
Go
333
star
24

spark-knowledgebase

Spark Knowledge Base
328
star
25

databricks-ml-examples

Python
284
star
26

sjsonnet

Scala
266
star
27

jsonnet-style-guide

Databricks Jsonnet Coding Style Guide
205
star
28

dbt-databricks

A dbt adapter for Databricks.
Python
199
star
29

databricks-sdk-py

Databricks SDK for Python (Beta)
Python
185
star
30

containers

Sample base images for Databricks Container Services
Dockerfile
163
star
31

databricks-sql-python

Databricks SQL Connector for Python
Python
158
star
32

sbt-spark-package

Sbt plugin for Spark packages
Scala
150
star
33

notebook-best-practices

An example showing how to apply software engineering best practices to Databricks notebooks.
Python
116
star
34

databricks-vscode

VS Code extension for Databricks
TypeScript
114
star
35

benchmarks

A place in which we publish scripts for reproducible benchmarks.
Python
106
star
36

terraform-databricks-examples

Examples of using Terraform to deploy Databricks resources
HCL
103
star
37

spark-tfocs

A Spark port of TFOCS: Templates for First-Order Conic Solvers (cvxr.com/tfocs)
Scala
88
star
38

intellij-jsonnet

Intellij Jsonnet Plugin
Java
82
star
39

sbt-databricks

An sbt plugin for deploying code to Databricks Cloud
Scala
71
star
40

terraform-databricks-lakehouse-blueprints

Set of Terraform automation templates and quickstart demos to jumpstart the design of a Lakehouse on Databricks. This project has incorporated best practices across the industries we work with to deliver composable modules to build a workspace to comply with the highest platform security and governance standards.
Python
71
star
41

spark-integration-tests

Integration tests for Spark
Scala
68
star
42

genai-cookbook

Python
63
star
43

spark-pr-dashboard

Dashboard to aid in Spark pull request reviews
JavaScript
54
star
44

run-notebook

TypeScript
47
star
45

ide-best-practices

Best practices for working with Databricks from an IDE
Python
47
star
46

unity-catalog-setup

Notebooks, terraform, tools to enable setting up Unity Catalog
45
star
47

simr

Spark In MapReduce (SIMR) - launching Spark applications on existing Hadoop MapReduce infrastructure
Java
44
star
48

devbox

Scala
38
star
49

databricks-sql-go

Golang database/sql driver for Databricks SQL.
Go
35
star
50

diviner

Grouped time series forecasting engine
Python
33
star
51

cli

Databricks CLI
Go
32
star
52

tmm

Python
30
star
53

security-bucket-brigade

JavaScript
30
star
54

databricks-sdk-go

Databricks SDK for Go
Go
29
star
55

pig-on-spark

proof-of-concept implementation of Pig-on-Spark integrated at the logical node level
Scala
28
star
56

databricks-sql-cli

CLI for querying Databricks SQL
Python
27
star
57

automl

Python
26
star
58

databricks-sql-nodejs

Databricks SQL Connector for Node.js
TypeScript
24
star
59

tpch-dbgen

Patched version of dbgen
C
22
star
60

als-benchmark-scripts

Scripts to benchmark distributed Alternative Least Squares (ALS)
Scala
22
star
61

spark-package-cmd-tool

A command line tool for Spark packages
Python
18
star
62

congruity

The goal of this library is to provide a compatibility layer that makes it easier to adopt Spark Connect. The library is designed to be simply imported in your application and will then monkey-patch the existing API to provide the legacy functionality.
Python
16
star
63

python-interview

Databricks Python interview setup instructions
15
star
64

xgb-regressor

MLflow XGBoost Regressor
Python
15
star
65

databricks-accelerators

Accelerate the use of Databricks for customers [public repo]
Python
15
star
66

tableau-connector

Scala
12
star
67

files_in_repos

Python
12
star
68

upload-dbfs-temp

TypeScript
12
star
69

spark-sklearn-docs

HTML
11
star
70

sqltools-databricks-driver

SQLTools driver for Databricks SQL
TypeScript
11
star
71

genomics-pipelines

secondary analysis pipelines parallelized with apache spark
Scala
10
star
72

workflows-examples

10
star
73

databricks-sdk-java

Databricks SDK for Java
Java
10
star
74

dais-cow-bff

Code for the "Path to Production" DAIS 2024 and 2023 talks
Jupyter Notebook
8
star
75

xgboost-linux64

Databricks Private xgboost Linux64 fork
C++
8
star
76

mlflow-example-sklearn-elasticnet-wine

Jupyter Notebook
7
star
77

databricks-ttyd

C
6
star
78

setup-cli

Sets up the Databricks CLI in your GitHub Actions workflow.
Shell
4
star
79

terraform-databricks-mlops-aws-project

This module creates and configures service principals with appropriate permissions and entitlements to run CI/CD for a project, and creates a workspace directory as a container for project-specific resources for the Databricks AWS staging and prod workspaces.
HCL
4
star
80

jenkins-job-builder

Fork of https://docs.openstack.org/infra/jenkins-job-builder/ to include unmerged patches
Python
4
star
81

terraform-databricks-mlops-azure-project-with-sp-creation

This module creates and configures service principals with appropriate permissions and entitlements to run CI/CD for a project, and creates a workspace directory as a container for project-specific resources for the Azure Databricks staging and prod workspaces. It also creates the relevant Azure Active Directory (AAD) applications for the service principals.
HCL
4
star
82

terraform-databricks-sra

The Security Reference Architecture (SRA) implements typical security features as Terraform Templates that are deployed by most high-security organizations, and enforces controls for the largest risks that customers ask about most often.
HCL
4
star
83

databricks-empty-ide-project

Empty IDE project used by the VSCode extension for Databricks
3
star
84

databricks-repos-proxy

Python
2
star
85

databricks-asset-bundles-dais2023

Python
2
star
86

pex

Fork of pantsbuild/pex with a few Databricks-specific changes
Python
2
star
87

SnpEff

Databricks snpeff fork
Java
2
star
88

notebook_gallery

Jupyter Notebook
2
star
89

terraform-databricks-mlops-aws-infrastructure

This module sets up multi-workspace model registry between a Databricks AWS development (dev) workspace, staging workspace, and production (prod) workspace, allowing READ access from dev/staging workspaces to staging & prod model registries.
HCL
2
star
90

expectations

Python
1
star
91

homebrew-tap

Homebrew Tap for the Databricks CLI
Ruby
1
star
92

terraform-databricks-mlops-azure-infrastructure-with-sp-creation

This module sets up multi-workspace model registry between an Azure Databricks development (dev) workspace, staging workspace, and production (prod) workspace, allowing READ access from dev/staging workspaces to staging & prod model registries. It also creates the relevant Azure Active Directory (AAD) applications for the service principals.
HCL
1
star
93

mfg_dlt_workshop

DLT Manufacturing Workshop
Python
1
star
94

databricks-dbutils-scala

The Scala SDK for Databricks.
Scala
1
star
95

kdd24-forecasting-anomaly-detection

Python
1
star
96

terraform-databricks-mlops-azure-project-with-sp-linking

This module creates and configures service principals with appropriate permissions and entitlements to run CI/CD for a project, and creates a workspace directory as a container for project-specific resources for the Azure Databricks staging and prod workspaces. It also links pre-existing Azure Active Directory (AAD) applications to the service principals.
HCL
1
star
97

terraform-databricks-mlops-azure-infrastructure-with-sp-linking

This module sets up multi-workspace model registry between an Azure Databricks development (dev) workspace, staging workspace, and production (prod) workspace, allowing READ access from dev/staging workspaces to staging & prod model registries. It also links pre-existing Azure Active Directory (AAD) applications to the service principals.
HCL
1
star