• Stars
    star
    161
  • Rank 225,318 (Top 5 %)
  • Language
    C++
  • License
    MIT License
  • Created almost 2 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.

Dynolog: a performance monitoring daemon for heterogeneous CPU-GPU systems

License DynologCI GitHub Release Issues PRs Welcome

Introduction

Dynolog is a lightweight monitoring daemon for heterogeneous CPU-GPU systems. It supports both always-on performance monitoring, as well as deep-dive profiling modes. The latter can be activated by making a remote procedure call to the daemon.

Below are some of the key features, which we will explore in more detail later in this Readme.

  • Dynolog integrates with the PyTorch Profiler and provides on-demand remote tracing features. One can use a single command line tool (dyno CLI) to simultaneously trace hundreds of GPUs and examine the collected traces (available from PyTorch v1.13.0 onwards).
  • It incorporates GPU performance monitoring for NVIDIA GPUs using DCGM.
  • Dynolog manages counters for micro-architecture specific performance events related to CPU Cache, TLBs etc on Intel and AMD CPUs. Additionally, it instruments telemetry from the Linux kernel including CPU, network and IO resource usage.
  • We are actively implementing new features, including support for Intel Processor Trace as well as memory latency and bandwidth monitoring.

We focus on Linux platforms as it is leveraged heavily in cloud environments.

Motivation

Large scale AI models use distributed AI training across multiple compute nodes. They also leverage hardware accelerators like GPUs to boost performance. One has to carefully optimize their AI applications to make the most of the underlying hardware while avoiding performance bottlenecks. This is where great performance monitoring and profiling tools become indispensable.

While there are existing solutions for monitoring (1, 2) and profiling CPUs (Intel’s VTune) and GPUs (NSight); it is challenging to assemble them together to get a holistic view of the system. For example, we need to understand whether an inefficiency on one resource like the communication fabric is slowing down the overall computation. Additionally, these solutions need to work in a production enviroment without causing performance degradation.

Dynolog leverages the underlying monitoring and profiling interfaces from the Linux kernel, CPU Performance Monitoring Units (PMUs) and GPUs. It also interacts with the pytorch profiler within the application to support on-demand profiling. In this way, it helps identify bottlenecks at various points in the system.

Supported Metrics

Dynolog’s always-on or continuous monitoring supports the following class of metrics:

  1. System/kernel metrics.
  2. CPU Performance Monitoring Unit (PMU) metrics using linux perf_event.
  3. NVIDIA GPU metrics from DCGM if enabled.

Detailed list of covered metrics are provided in docs/Metrics.md.

Getting Started

Installation

Dynolog can be installed using the package manager of your choice with either RPM for CentoOS or Debian for Ubuntu like distros. We do not support non-Linux platforms. There are no required dependencies except for DCGM if you need to monitor GPUs, please see the section on GPU monitoring below.

Obtain the latest dynolog release or pick up one of the releases here

# for CentoOS
wget https://github.com/facebookincubator/dynolog/releases/downdload/v0.2.1/dynolog-0.2.1-1.el8.x86_64.rpm
sudo rpm -i dynolog-0.2.1-1.el8.x86_64.rpm

# for Ubuntu or similar Debian based linux distros
wget https://github.com/facebookincubator/dynolog/releases/download/v0.2.1/dynolog_0.2.1-0-amd64.deb
sudo dpkg -i dynolog_0.2.1-0-amd64.deb

No sudo access?

Dynolog can run in userspace mode with most features functional. There are a few options to run dynolog userspace. One way is to simply decompress the RPM or debian packages as shown below.

mkdir -p dynolog_pkg; cd dynolog_pkg
wget https://github.com/facebookincubator/dynolog/releases/download/v0.2.1/dynolog_0.2.1-0-amd64.deb
ar x dynolog_0.2.1-0-amd64.deb; tar xvf data.tar.xz
# binaries should now be available in ./usr/local/bin, you can add this directory to your $PATH.

Alternatively, you can build dynolog from source. The binaries should be present in the build/bin directory. The packages provides systemd support to run the server as a daemon. You can however still run dynolog server directly in a separate terminal.

Running dynolog

Start the Dynolog service using systemd -

sudo systemctl start dynolog

Note:

  • The dynolog service picks up runtime flags from /etc/dynolog.gflags if the file is present.
  • Output logs will be written to /var/logs/dynolog.log and logs are automatically rotated.

One can check the values of the metrics emitted in the output log file.

$> tail /var/log/dynolog.log
I20220721 23:42:34.141083 3632432 Logger.cpp:37] Logging : 12 values
I20220721 23:42:34.141104 3632432 Logger.cpp:38] time = 2022-07-21T23:42:34.141Z data = {"cpu_i":"71.342"

The dyno command line tool communicates with the dynolog daemon on the local or remote system. For example, we can verify if the daemon is running using the status subcommand.

$> dyno status
response length = 12
response = {"status":1}

$> dyno --hostname some_remote_host.com status
response length = 12
response = {"status":1}

Run dyno --help for help on other subcommands.

Server Command Line options Lastly, the dynolog server provides various flags, we list the key ones here. Run dynolog --help for more info.

  • --port (default = 1778) - the port used to setup a service for remote queries.
  • --reporting_interval_s (default=60) - the reporting interval for metrics. Please see the Logging section for more details.
  • --enable_ipc_monitor sets up inter-process communication endpoint. This can be used to talk to applications like pytorch trainers.

Collecting pytorch/GPU traces

To enable pytorch profiling add the flag --enable_ipc_monitor. This enables the server to communicate with the pytorch process. If you are running the server with systemd, do the following-

echo "--enable_ipc_monitor" | sudo tee -a /etc/dynolog.gflags
sudo systemctl restart dynolog

You also need to use a compatible version of pytorch v1.13.0 and set the env variable KINETO_USE_DAEMON=1 before running the pytorch program. See docs/pytorch_profiler.md for details.

Traces can now be captured using the gputrace subcommand

dyno gputrace --pids <pid of process> --log_file <output file path>

Dynolog can also 1) capture traces on remote nodes, 2) co-ordinate tracing across a distributed training job (with slurm job scheduler). Please see the recipe in docs/pytorch_profiler.md for a detailed walkthrough of this feature.

GPU Monitoring

Dynolog leverages NVIDIA Datacenter GPU Manager DCGM for NVIDIA GPUs today. DCGM supports GPU models from Kepler/Volta V100 onwards, Please see this page for details. Currently Dynolog dynamically supports both DCGM 2.x and DCGM 3.x based on the version of the shared library libdcgm.so installed on the system.

Prerequisite: Install DCGM on your system following the instructions here. After the installation please make sure to initialize the dcgm service using -

sudo systemctl --now enable nvidia-dcgm

To run dynolog with GPU monitoring you can add/tweak these flags

  • --enable_gpu_monitor (default=false): enable GPU monitoring using DCGM.
  • --dcgm_lib_path (default=/lib64/libdcgm.so): the path to DCGM shared library libdcgm.so. If the default path does not exist please adjust it.
  • --dcgm_fields (default=”100,155,204,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012”): comma separated string of DCGM field ids to monitor, please see Nvidia headers for field definitions
  • --dcgm_report_interval_s (default=10): the interval between each DCGM counter update in second. Please note when this value is too small DCGM may start multiplexing the counters and potentially affect counter accuracy.
echo "--enable_gpu_monitor" | sudo tee -a /etc/dynolog.gflags
sudo systemctl restart dynolog

CPU Performance Events

Dynolog also supports collection of CPU hardware performance events. We added CPU instructions and cycles as the first set of counters; referred to as mips (millions of instructions per second) and mega_instructions_per_second. See docs/Metrics.md for more details.

The following flags are relevant to hardware PMU performance monitoring:

  • --enable_perf_monitor (default=false): enable hardware PMU performance monitoring.
  • --perf_monitor_reporting_interval_s (default=60): set the reporting interval of performance metrics in seconds.

Sample logs emitted to the log file:

I20221208 15:28:34.730270 3345417 Logger.cpp:55] Logging : 2 values
I20221208 15:28:34.730316 3345417 Logger.cpp:56] time = 1969-12-31T16:00:00.000Z data = {"mega_cycles_per_second":"735.696","mips":"691.497"}
I20221208 15:29:34.731652 3345417 Logger.cpp:55] Logging : 2 values
I20221208 15:29:34.731690 3345417 Logger.cpp:56] time = 1969-12-31T16:00:00.000Z data = {"mega_cycles_per_second":"514.156","mips":"479.397"}

Building from source

Dynolog consists of two binaries - 1) dynolog server and 2) dyno command line tool. To build from source please check out the project and initialize submodules.

git clone https://github.com/facebookincubator/dynolog.git
git submodule update --init --recursive

Requirements

This project mainly depends on C++17 and Rust compiler tool chains. Other dependencies include cmake and the ninja build system. We have tested dynolog works with the following versions of C++ and Rust toolchains

Language Toolchain
C++ gcc 8.5.0+
Rust Rust 1.58.1 (1.56+ required for clap dependency)

Based on your system you could use one of the following instructions:

RHEL/CentOS

Install cmake, ninja, and cargo (for Rust).

sudo dnf install cmake ninja cargo

Alternatively, you can install rust using the script provided here.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Ubuntu/Debian

Install cmake and ninja

sudo apt-get install -y cmake ninja-build

Install rust using this script provided here.

Conda

In case of no root access you can install these dependencies in a conda environment as well. A few features may not work without root access (we will document more details soon).

conda install cmake ninja
conda install -c conda-forge "rust=1.64.0"

GPU Monitoring

For NVIDIA GPUs (currently the only ones supported) you can install DCGM as described in GPU monitoring section.

Building

Please use the build script, this should generate files in ./build relative to your project root. The script will print the path to the server and client.

./scripts/build.sh

Note that the build system for Rust will need an internet connection for the first time.

Building packages

The preferred method to run dynolog is by deploying a package - either RPM or debian. Please see scripts/README.md for instructions on how to build dynolog packages.

Logging

By default dynolog will save monitoring metrics to the standard output -

I20220721 23:42:34.141104 3632432 Logger.cpp:38] time = 2022-07-21T23:42:34.141Z data = {"cpu_i":"71.342" ...

Dynolog includes an abstract Logger class that can be specialized for different logging destinations. Currently, Dynolog support logging to Meta ODS datastore, and Meta Scuba data system, instructions can be found in docs/logging_to_ods.md and docs/logging_to_scuba.md. Dynolog team is happy to support new loggers.

Releases

Please see our releases page for latest releases and features/fixes included in them.

In the next and near term release we plan to add

  • Disk usage metrics like storage size and IO operations per second.
  • Memory latency and bandwidth monitoring per socket for Intel and AMD CPUs.

At some future point we would also like to add -

  • Capability to collect CPU traces using Intel Processor Trace.
  • Open telemetry support for logging.

Contact Us

Dynolog is actively maintained by Meta engineers: Brian Coutinho, Zachary Jones, Hao Wang, William Sumendap, Jakob Johnson,Alston Tang, Walter Erquinigo,David Carrillo Cisneros. We would also like to thank various contributors to the dynolog project internally- Sam Crossley, Jayesh Lahori, Matt Skach, Darshan Sanghani, Lucas Molander, Parth Malani, Jason Taylor - this is by no means an exhaustive list. Special thanks to Herman Chin, Victor Henriquez, Anupam Bhatnagar, Caleb Ho, Aaron Shi, Jay Chae, Song Liu, Susan Zhang, Geeta Chauhan and Gisle Dankel for supporting this initiative

Join the dynolog community

Please file bugs or feature requests as issues on github. See the CONTRIBUTING file for how to help out.

License

Dynolog is licensed under the MIT License.

More Repositories

1

SocketRocket

A conforming Objective-C WebSocket client library.
Objective-C
9,524
star
2

katran

A high performance layer 4 load balancer
C
4,488
star
3

AITemplate

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Python
4,418
star
4

cinder

Cinder is Meta's internal performance-oriented production version of CPython.
Python
3,349
star
5

velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
C++
3,138
star
6

spectrum

A client-side image transcoding library.
C++
1,985
star
7

FBX2glTF

A command-line tool for the conversion of 3D model assets on the FBX file format to the glTF file format.
C++
1,963
star
8

oomd

A userspace out-of-memory killer
C++
1,745
star
9

xar

executable archive format
Python
1,578
star
10

fastmod

A fast partial replacement for the codemod tool
Rust
1,570
star
11

Bowler

Safe code refactoring for modern Python.
Python
1,506
star
12

gloo

Collective communications library with various primitives for multi-machine training.
C++
1,128
star
13

fizz

C++14 implementation of the TLS-1.3 standard
C++
1,104
star
14

submitit

Python 3.8+ toolbox for submitting jobs to Slurm
Python
1,075
star
15

dhcplb

dhcplb is Facebook's implementation of a load balancer for DHCP.
Go
1,035
star
16

below

A time traveling resource monitor for modern Linux systems
Rust
975
star
17

OnlineSchemaChange

A tool for performing online schema changes on MySQL.
Python
951
star
18

Glean

System for collecting, deriving and working with facts about source code.
Hack
886
star
19

Battery-Metrics

Library that helps in instrumenting battery related system metrics.
Java
720
star
20

retrie

Retrie is a powerful, easy-to-use codemodding tool for Haskell.
Haskell
490
star
21

superconsole

The superconsole crate provides a handler and building blocks for powerful, yet minimally intrusive TUIs. It is cross platform, supporting Windows 7+, Linux, and MacOS. Rustaceans who want to create non-interactive TUIs can use the component composition building block system to quickly deploy their code.
Rust
447
star
22

nvdtools

A set of tools to work with the feeds (vulnerabilities, CPE dictionary etc.) distributed by National Vulnerability Database (NVD)
Go
428
star
23

infima

A UI framework that provides websites with the minimal CSS and JS needed to get started with building a modern responsive beautiful website
HTML
393
star
24

CG-SQL

CG/SQL is a compiler that converts a SQL Stored Procedure like language into C for SQLite. SQLite has no stored procedures of its own. CG/CQL can also generate other useful artifacts for testing and schema maintenance.
HTML
385
star
25

flowtorch

This library would form a permanent home for reusable components for deep probabilistic programming. The library would form and harness a community of users and contributors by focusing initially on complete infra and documentation for how to use and create components.
Jupyter Notebook
297
star
26

ptr

Python Test Runner.
Python
285
star
27

TTPForge

The TTPForge is a Cybersecurity Framework for developing, automating, and executing attacker Tactics, Techniques, and Procedures (TTPs).
Go
280
star
28

fbjni

A library designed to simplify the usage of the Java Native Interface
C++
245
star
29

senpai

Senpai is an automated memory sizing tool for container applications.
Python
213
star
30

gazebo

A Rust library containing a collection of small well-tested primitives.
Rust
210
star
31

reindeer

Reindeer is a tool to transform Rust Cargo dependencies into generated Buck build rules
Rust
157
star
32

FCR

FBNet-Command-Runner: A thrift service to run commands on heterogeneous Network devices with configurable parameters.
Python
154
star
33

GeoLift

GeoLift is an end-to-end geo-experimental methodology based on Synthetic Control Methods used to measure the true incremental effect (Lift) of ad campaign.
R
149
star
34

oculus-linux-kernel

The Linux kernel code for Oculus devices
C
148
star
35

hsthrift

The Haskell Thrift Compiler. This is an implementation of the Thrift spec that generates code in Haskell. It depends on the fbthrift project for the implementation of the underlying transport.
Haskell
143
star
36

dispenso

The project provides high-performance concurrency, enabling highly parallel computation.
C++
141
star
37

FioSynth

Tool which enables the creation of synthetic storage workloads, automates the execution and results collection of synthetic storage benchmarks.
Python
136
star
38

dataclassgenerate

DataClassGenerate (or simply DCG) is a Kotlin compiler plugin that addresses an Android APK size overhead from Kotlin data classes.
Kotlin
134
star
39

meta-code-verify

Code Verify is an open source web browser extension that confirms that your Facebook, Messenger, Instagram, and WhatsApp Web code hasn’t been tampered with or altered, and that the Web experience you’re getting is the same as everyone else’s.
TypeScript
133
star
40

go-qfext

a fast counting quotient filter implementation in golang
Go
88
star
41

tacquito

Tacquito is an open source TACACs+ server written in Go that implements RFC8907
Go
82
star
42

dcrpm

A tool to detect and correct common issues around RPM database corruption.
Python
72
star
43

ForgeArmory

ForgeArmory provides TTPs that can be used with the TTPForge (https://github.com/facebookincubator/ttpforge).
Swift
67
star
44

antlir

ANother Linux Image buildeR
Rust
63
star
45

ConversionsAPI-Tag-for-GoogleTagManager

This repository will contain the artifacts needed for setting up Conversions API implementation on Google Tag Manager's serverside. Please follow the instructions https://www.facebook.com/business/help/702509907046774
Smarty
63
star
46

InjKit

Injection Kit. It is a java bytecode processing library for bytecode injection and transformation.
Java
56
star
47

sks

Secure Key Storage (SKS) is a library for Go that abstracts Security Hardware on laptops.
Go
55
star
48

obs-plugins

OBS Plugins
C++
54
star
49

glTFVariantMeld

An application that accepts files on the glTF format, interprets them as variants of an over-arching whole, and melds them together.
Rust
47
star
50

later

A framework for python asyncio with batteries included for people writing services in python asyncio
Python
38
star
51

go2chef

A Golang tool to bootstrap a system from zero so that it's able to run Chef to be managed
Go
38
star
52

ConversionsAPI-Client-for-GoogleTagManager

This repository will contain the artifacts needed for setting up Conversions API implementation on Google Tag Manager's serverside. Primarily we will be hosting, - ConversionsAPI(Facebook) Client - listens on the events fired to GTM Server and maps them to common GTM schema. - ConversionsAPI(Facebook) Tag - server tag that fires events to CAPI.For more details on Design here https//fburl.com/uae68vlr
37
star
53

CommutingZones

Commuting zones are geographic areas where people live and work and are useful for understanding local economies, as well as how they differ from traditional boundaries. These zones are a set of boundary shapes built using aggregated estimates of home and work locations. Data used to build commuting zones is aggregated and de-identified.
JavaScript
37
star
54

Facebook-Pixel-for-Wordpress

A plugin for advertisers who use Wordpress to enable them easily setup the facebook pixel.
JavaScript
34
star
55

wordpress-messenger-customer-chat-plugin

Messenger Customer Chat Plugin for WordPress
PHP
26
star
56

CP4M

CP4M is a conversational marketing platform which enables advertisers to integrate their customer-facing chatbots with FB Messenger/WhatsApp, in order to meet customers where they are and drive native conversations on the advertiser's owned infra.
Java
26
star
57

rush

RUSH (Reliable - unreliable - Streaming Protocol)
C++
22
star
58

buck2-change-detector

Given a Buck2 built project and a set of changes (e.g. from source control) compute the targets that may have changed. Sometimes known as a target determinator, useful for optimizing a CI system.
Rust
18
star
59

MY_ENUM

Small c++ macro library to add compile-time introspection to c++ enum classes.
C++
15
star
60

spark-ar-core-libs

Core libraries that can be used in Spark AR. You can import each library depends on your requirements.
TypeScript
15
star
61

SafeC

Library containing safer alternatives/wrappers for insecure C APIs.
C++
14
star
62

go-belt

It is an implementation-agnostic Go(lang) package to generalize observability tooling (logger, metrics, tracer and so on) and provide ability to use any of these tools with a standard context. Essentially it is an attempt to standardize observability API in Go.
Go
14
star
63

Portal-Kernel

Kernel Code for Portal.
C
11
star
64

sado

A macOS signed-app shim for running daemons with reliable capabilities.
Swift
10
star
65

npe-toolkit

Libraries, guides, blueprints, and sample code, to enable rapidly building 0-1 applications on iOS, Android and web.
TypeScript
9
star
66

Eigen-FBPlugins

This is collection of plugins extending Eigen arrays/matrices with main focus on using them for computer vision. In particular, this project should provide support for multichannel arrays (missing in vanilla Eigen) and seamless integration between Eigen types and OpenCV functions.
C++
8
star
67

isometric_pattern_matcher

A new isometric calibration pattern - which should/might lead to higher accuracy calibrations compared to existing solutions (checkerboards, patterns of circles).
C++
8
star
68

dnf-plugin-cow

Code to enable Copy on Write features being upstreamed in rpm and librepo
Shell
8
star
69

wireguard_py

Cython library for Wireguard
C
6
star
70

strobelight

Meta's fleetwide profiler framework
6
star
71

jupyterhub_fb_authenticator

JupyterHub Facebook Authenticator is a Facebook OAuth authenticator built on top of OAuthenticator.
Python
5
star
72

meta-fbvuln

OpenEmbedded meta-layer that allows producing a vulnerability manifest alongside a Yocto build. The produced manifest is suitable for ongoing vulnerability scanning of fielded software.
5
star
73

gazebo_lint

A Rust linter that provides various suggestions based on the new primitives offered in the `gazebo` library.
Rust
4
star
74

kernel-patches-daemon

Sync Patchwork series's with Github pull requests
Python
4
star
75

scrut

Scrut is a testing toolkit for CLI applications. A tool to scrutinize terminal programs without fuss.
Rust
4
star
76

language-capirca

Adds syntax highlighting for Capirca filetypes in Atom. Capirca is an open source standard for writing vendor-neutral firewall policies as originally released by Google: https://github.com/google/capirca
3
star
77

fbc_owrt_feed

Facebook Connectivity OpenWrt Feed. Package feed for OpenWrt router OS by Facebook Connectivity programme.
Lua
2
star
78

cutlass-fork

A Meta fork of NV CUTLASS repo.
C++
2
star