• Stars
    star
    212
  • Rank 186,122 (Top 4 %)
  • Language
    Scala
  • License
    MIT License
  • Created almost 6 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Monitoring Azure Databricks jobs

Monitoring Azure Databricks in an Azure Log Analytics Workspace

This branch of the library supports Azure Databricks Runtimes 10.x (Spark 3.2.x) and earlier (see Supported configurations).
Databricks has contributed an updated version to support Azure Databricks Runtimes 11.0 (Spark 3.3.x) and above on the l4jv2 branch at: https://github.com/mspnp/spark-monitoring/tree/l4jv2.
Be sure to use the correct branch and version for your Databricks Runtime.
⚠️ This library and GitHub repository are in maintenance mode. There are no plans for further releases, and issue support will be best-effort only. For any additional questions regarding this library or the roadmap for monitoring and logging of your Azure Databricks environments, please contact [email protected].

This repository extends the core monitoring functionality of Azure Databricks to send streaming query event information to Azure Monitor. For more information about using this library to monitor Azure Databricks, see Monitoring Azure Databricks

The project has the following directory structure:

/src
    /spark-listeners-loganalytics
    /spark-listeners
    /pom.xml
/sample
    /spark-sample-job
/perftools
     /spark-sample-job

The spark-listeners-loganalytics and spark-listeners directories contain the code for building the two JAR files that are deployed to the Databricks cluster. The spark-listeners directory includes a scripts directory that contains a cluster node initialization script to copy the JAR files from a staging directory in the Azure Databricks file system to execution nodes. The pom.xml file is the main Maven project object model build file for the entire project.

The spark-sample-job directory is a sample Spark application demonstrating how to implement a Spark application metric counter.

The perftools directory contains details on how to use Azure Monitor with Grafana to monitor Spark performance.

Prerequisites

Before you begin, ensure you have the following prerequisites in place:

Supported configurations

Databricks Runtime(s) Maven Profile
7.3 LTS scala-2.12_spark-3.0.1
9.1 LTS scala-2.12_spark-3.1.2
10.3 - 10.5 scala-2.12_spark-3.2.1
11.0 See https://github.com/mspnp/spark-monitoring/tree/l4jv2

Logging Event Size Limit

This library currently has a size limit per event of 25MB, based on the Log Analytics limit of 30MB per API Call with additional overhead for formatting. The default behavior when hitting this limit is to throw an exception. This can be changed by modifying the value of EXCEPTION_ON_FAILED_SEND in GenericSendBuffer.java to false.

Note: You will see an error like: java.lang.RuntimeException: Failed to schedule batch because first message size nnn exceeds batch size limit 26214400 (bytes). in the Spark logs if your workload is generating logging messages of greater than 25MB, and your workload may not proceed. You can query Log Analytics for this error condition with:

SparkLoggingEvent_CL
| where TimeGenerated > ago(24h)
| where Message contains "java.lang.RuntimeException: Failed to schedule batch because first message size"

Build the Azure Databricks monitoring library

You can build the library using either Docker or Maven. All commands are intended to be run from the base directory of the repository.

The jar files that will be produced are:

spark-listeners_<Spark Version>_<Scala Version>-<Version>.jar - This is the generic implementation of the Spark Listener framework that provides capability for collecting data from the running cluster for forwarding to another logging system.

spark-listeners-loganalytics_<Spark Version>_<Scala Version>-<Version>.jar - This is the specific implementation that extends spark-listeners. This project provides the implementation for connecting to Log Analytics and formatting and passing data via the Log Analytics API.

Option 1: Docker

Linux:

# To build all profiles:
docker run -it --rm -v `pwd`:/spark-monitoring -v "$HOME/.m2":/root/.m2 mcr.microsoft.com/java/maven:8-zulu-debian10 /spark-monitoring/build.sh
# To build a single profile (example for latest long term support version 10.4 LTS):
docker run -it --rm -v `pwd`:/spark-monitoring -v "$HOME/.m2":/root/.m2 -w /spark-monitoring/src mcr.microsoft.com/java/maven:8-zulu-debian10 mvn install -P "scala-2.12_spark-3.2.1"

Windows:

# To build all profiles:
docker run -it --rm -v %cd%:/spark-monitoring -v "%USERPROFILE%/.m2":/root/.m2 mcr.microsoft.com/java/maven:8-zulu-debian10 /spark-monitoring/build.sh
# To build a single profile (example for latest long term support version 10.4 LTS):
docker run -it --rm -v %cd%:/spark-monitoring -v "%USERPROFILE%/.m2":/root/.m2 -w /spark-monitoring/src mcr.microsoft.com/java/maven:8-zulu-debian10 mvn install -P "scala-2.12_spark-3.2.1"

Option 2: Maven

  1. Import the Maven project project object model file, pom.xml, located in the /src folder into your project. This will import two projects:

    • spark-listeners
    • spark-listeners-loganalytics
  2. Activate a single Maven profile that corresponds to the versions of the Scala/Spark combination that is being used. By default, the Scala 2.12 and Spark 3.0.1 profile is active.

  3. Execute the Maven package phase in your Java IDE to build the JAR files for each of the these projects:

    Project JAR file
    spark-listeners spark-listeners_<Spark Version>_<Scala Version>-<Version>.jar
    spark-listeners-loganalytics spark-listeners-loganalytics_<Spark Version>_<Scala Version>-<Version>.jar

Configure the Databricks workspace

Copy the JAR files and init scripts to Databricks.

  1. Use the Azure Databricks CLI to create a directory named dbfs:/databricks/spark-monitoring:

    dbfs mkdirs dbfs:/databricks/spark-monitoring
  2. Open the /src/spark-listeners/scripts/spark-monitoring.sh script file and add your Log Analytics Workspace ID and Key to the lines below:

    export LOG_ANALYTICS_WORKSPACE_ID=
    export LOG_ANALYTICS_WORKSPACE_KEY=

If you do not want to add your Log Analytics workspace id and key into the init script in plaintext, you can also create an Azure Key Vault backed secret scope and reference those secrets through your cluster's environment variables.

  1. In order to add x-ms-AzureResourceId header as part of the http request, modify the following environment variables on /src/spark-listeners/scripts/spark-monitoring.sh. For instance:
export AZ_SUBSCRIPTION_ID=11111111-5c17-4032-ae54-fc33d56047c2
export AZ_RSRC_GRP_NAME=myAzResourceGroup
export AZ_RSRC_PROV_NAMESPACE=Microsoft.Databricks
export AZ_RSRC_TYPE=workspaces
export AZ_RSRC_NAME=myDatabricks

Now the _ResourceId /subscriptions/11111111-5c17-4032-ae54-fc33d56047c2/resourceGroups/myAzResourceGroup/providers/Microsoft.Databricks/workspaces/myDatabricks will be part of the header. (Note: If at least one of them is not set the header won't be included.)

  1. Use the Azure Databricks CLI to copy src/spark-listeners/scripts/spark-monitoring.sh to the directory created in step 3:

    dbfs cp src/spark-listeners/scripts/spark-monitoring.sh dbfs:/databricks/spark-monitoring/spark-monitoring.sh
  2. Use the Azure Databricks CLI to copy all of the jar files from the src/target folder to the directory created in step 3:

    dbfs cp --overwrite --recursive src/target/ dbfs:/databricks/spark-monitoring/

Create and configure the Azure Databricks cluster

  1. Navigate to your Azure Databricks workspace in the Azure Portal.
  2. Under "Compute", click "Create Cluster".
  3. Choose a name for your cluster and enter it in "Cluster name" text box.
  4. In the "Databricks Runtime Version" dropdown, select Runtime: 10.4 LTS (Scala 2.12, Spark 3.2.1).
  5. Under "Advanced Options", click on the "Init Scripts" tab. Go to the last line under the "Init Scripts section" Under the "destination" dropdown, select "DBFS". Enter "dbfs:/databricks/spark-monitoring/spark-monitoring.sh" in the text box. Click the "add" button.
  6. Click the "Create Cluster" button to create the cluster. Next, click on the "start" button to start the cluster.

Run the sample job (optional)

The repository includes a sample application that shows how to send application metrics and application logs to Azure Monitor.

When building the sample job, specify a maven profile compatible with your databricks runtime from the supported configurations section.

  1. Use Maven to build the POM located at sample/spark-sample-job/pom.xml or run the following Docker command:

    Linux:

    docker run -it --rm -v `pwd`/sample/spark-sample-job:/spark-sample-job -v "$HOME/.m2":/root/.m2 -w /spark-sample-job mcr.microsoft.com/java/maven:8-zulu-debian10 mvn install -P <maven-profile>

    Windows:

    docker run -it --rm -v %cd%/sample/spark-sample-job:/spark-sample-job -v "%USERPROFILE%/.m2":/root/.m2 -w /spark-sample-job mcr.microsoft.com/java/maven:8-zulu-debian10 mvn install -P <maven-profile>
  2. Navigate to your Databricks workspace and create a new job, as described here.

  3. In the job detail page, set Type to JAR.

  4. For Main class, enter com.microsoft.pnp.samplejob.StreamingQueryListenerSampleJob.

  5. Upload the JAR file from /src/spark-jobs/target/spark-jobs-1.0-SNAPSHOT.jar in the Dependent Libraries section.

  6. Select the cluster you created previously in the Cluster section.

  7. Select Create.

  8. Click the Run Now button to launch the job.

When the job runs, you can view the application logs and metrics in your Log Analytics workspace. After you verify the metrics appear, stop the sample application job.

Viewing the Sample Job's Logs in Log Analytics

After your sample job has run for a few minutes, you should be able to query for these event types in Log Analytics:

SparkListenerEvent_CL

This custom log will contain Spark events that are serialized to JSON. You can limit the volume of events in this log with filtering. If filtering is not employed, this can be a large volume of data.

Note: There is a known issue when the Spark framework or workload generates events that have more than 500 fields, or where data for an individual field is larger than 32kb. Log Analytics will generate an error indicating that data has been dropped. This is an incompatibility between the data being generated by Spark, and the current limitations of the Log Analytics API.

Example for querying SparkListenerEvent_CL for job throughput over the last 7 days:

let results=SparkListenerEvent_CL
| where TimeGenerated > ago(7d)
| where  Event_s == "SparkListenerJobStart"
| extend metricsns=column_ifexists("Properties_spark_metrics_namespace_s",Properties_spark_app_id_s)
| extend apptag=iif(isnotempty(metricsns),metricsns,Properties_spark_app_id_s)
| project Job_ID_d,apptag,Properties_spark_databricks_clusterUsageTags_clusterName_s,TimeGenerated
| order by TimeGenerated asc nulls last
| join kind= inner (
    SparkListenerEvent_CL
    | where Event_s == "SparkListenerJobEnd"
    | where Job_Result_Result_s == "JobSucceeded"
    | project Event_s,Job_ID_d,TimeGenerated
) on Job_ID_d;
results
| extend slice=strcat("#JobsCompleted ",Properties_spark_databricks_clusterUsageTags_clusterName_s,"-",apptag)
| summarize count() by bin(TimeGenerated, 1h),slice
| order by TimeGenerated asc nulls last

SparkLoggingEvent_CL

This custom log will contain data forwarded from Log4j (the standard logging system in Spark). The volume of logging can be controlled by altering the level of logging to forward or with filtering.

Example for querying SparkLoggingEvent_CL for logged errors over the last day:

SparkLoggingEvent_CL
| where TimeGenerated > ago(1d)
| where Level == "ERROR"

SparkMetric_CL

This custom log will contain metrics events as generated by the Spark framework or workload. You can adjust the time period or sources included by modifying the METRICS_PROPERTIES section of the spark-monitoring.sh script or by enabling filtering.

Example of querying SparkMetric_CL for the number of active executors per application over the last 7 days summarized every 15 minutes:

SparkMetric_CL
| where TimeGenerated > ago(7d)
| extend sname=split(name_s, ".")
| where sname[2] == "executor"
| extend executor=strcat(sname[1]) 
| extend app=strcat(sname[0])
| summarize NumExecutors=dcount(executor) by bin(TimeGenerated,  15m),app
| order by TimeGenerated asc nulls last

Note: For more details on how to use the saved search queries in logAnalyticsDeploy.json to understand and troubleshoot performance, see Observability patterns and metrics for performance tuning.

Filtering

The library is configurable to limit the volume of logs that are sent to each of the different Azure Monitor log types. See filtering for more details.

Debugging

If you encounter any issues with the init script, you can refer to the docs on debugging.

Contributing

See: CONTRIBUTING.md

More Repositories

1

microservices-reference-implementation

A reference implementation demonstrating microservices architecture and best practices for Microsoft Azure
Shell
822
star
2

cloud-design-patterns

Prescriptive Architecture Guidance for Cloud Applications
C#
726
star
3

performance-optimization

Guidance on how to observe, measure, and correct common issues in a cloud-based system.
C#
688
star
4

reference-architectures

templates and scripts for deploying Azure Reference Architectures
C#
640
star
5

aks-baseline

This is the Azure Kubernetes Service (AKS) Baseline Cluster reference implementation as produced by the Microsoft Azure Architecture Center.
Bicep
615
star
6

template-building-blocks

A tool for deploying Azure infrastructure based on proven practices. Azure building blocks take advantage of the Azure CLI and Azure Resource Manager templates to provision collections of resources as logical units with production-ready settings.
JavaScript
328
star
7

AzureNamingTool

The Azure Naming Tool is a .NET 8 Blazor application, with a RESTful API. The UI consists of several pages to allow the configuration and generation of Azure Resource names. The API provides a programmatic interface for the functionality.
HTML
183
star
8

serverless-reference-implementation

Serverless reference implementation guidance
C#
167
star
9

aks-fabrikam-dronedelivery

AKS Fabrikam Drone Delivery ❤️ AKS baseline
Mustache
121
star
10

samples

Bicep
120
star
11

aks-baseline-regulated

This is the Azure Kubernetes Service (AKS) baseline cluster for regulated workloads reference implementation as produced by the Microsoft Azure Architecture Center.
Bicep
106
star
12

azure-databricks-streaming-analytics

Stream processing with Azure Databricks
Scala
105
star
13

transactional-outbox-pattern

An implementation of the Transactional Outbox Pattern with Cosmos DB
C#
58
star
14

aks-baseline-multi-region

This is the Azure Kubernetes Service (AKS) baseline for multi-region reference implementation as produced by the Microsoft Azure Architecture Center.
Shell
51
star
15

identity-reference-architectures

Reference architectures for extending your Active Directory environment to Azure
PowerShell
48
star
16

solution-architectures

This content is referenced by Azure Architecture Center articles.
Shell
45
star
17

iot-guidance

Code samples that show best practices for building IoT solutions.
C#
32
star
18

cloud-services-to-service-fabric

Migrate a Cloud Services application to Service Fabric
C#
29
star
19

container-apps-fabrikam-dronedelivery

Bicep
27
star
20

microservices-reference-implementation-servicefabric

Microservices reference implementation deployed to Azure Service Fabric
C#
20
star
21

vnet-integrated-serverless-microservices

TypeScript
20
star
22

azure-stream-analytics-data-pipeline

C#
16
star
23

gridwich

Gridwich - Media Processing System
C#
14
star
24

go-batcher

Batching and rate-limiting for go without any opinion of the datastore.
Go
12
star
25

interruptible-workload-on-spot

Interruptible workloads on Azure Spot VM instances reference implementation as produced by the Microsoft Azure Architecture Center.
Bicep
11
star
26

serverless-automation

Scenarios around automating tasks using Azure serverless technologies
PowerShell
11
star
27

fabrikam-dronedelivery-workload

This repository contains source files for services that are shared by the microservices and fabrikam-drone delivery reference implementations.
C#
11
star
28

template-examples

Extend Azure Resource Manager template functionality.
10
star
29

app-service-environments-ILB-deployments

Bicep
9
star
30

aks-jumpbox-imagebuilder

An example of using Azure Image Builder to generate a jump box image to be used for ops access on network-restricted AKS clusters.
Bicep
9
star
31

cognitive-services-reference-implementation

This reference implementation builds the first phase of a call center analytics pipeline using Azure Cognitive Speech API Service, Azure Function, Blob storage and an app service.
C#
8
star
32

letsencrypt-pip-cert-generation

A method one can use to generate a Let’s Encrypt® certificate for a Azure Public IP domain prefix.
Shell
6
star
33

geode-pattern-accelerator

The accelerator is designed to help developers with Azure Functions based APIs that utilize Cosmos DB as a data store to implement the geode pattern by deploying their API to geodes in distributed Azure regions.
HCL
6
star
34

iaas-landing-zone-baseline

This is the IaaS baseline for Azure landing zones reference implementation as produced by the Azure Architecture Center.
Bicep
5
star
35

iaas-baseline

Infrastructure as a Service (IaaS) baseline reference implementation
Bicep
4
star
36

multi-stage-azure-pipeline-automation-app

The project demonstrates how to automate azure pipelines to deploy a dotnet-angular project to azure app service
TypeScript
4
star
37

multi-stage-azure-pipeline-automation

The project uses Azure Logic App to Automate Azure DevOps Multistage Pipelines
PowerShell
3
star
38

aci-auto-healing

Using serverless automation to update backend pools on Azure Application Gateway in response to changes in Azure Container Instances.
Bicep
1
star
39

intern-js-pipeline

Nightly Build Testing with Playwright - automated build testing and monitoring for technical documentation
JavaScript
1
star
40

hilojs

JavaScript
1
star