• Stars
    star
    167
  • Rank 218,722 (Top 5 %)
  • Language
    Elixir
  • License
    MIT License
  • Created over 5 years ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Efficient Process.monitor replacement

ZenMonitor

CI Hex.pm Version Hex.pm License HexDocs

ZenMonitor allows for the efficient monitoring of remote processes with minimal use of ERTS Distribution.

Installation

Add ZenMonitor to your dependencies

def deps do
  [
    {:zen_monitor, "~> 2.1.0"}
  ]
end

Using ZenMonitor

ZenMonitor strives to be a drop-in replacement for Process.monitor/1. To those ends, the programming interface and all the complexities of how it carries out its task are simplified by a simple unified programming interface. All the functions that the caller needs to use have convenient delegates available in the top-level ZenMonitor module. The interface is detailed below.

ZenMonitor.monitor/1

This is a drop-in replacement for Process.monitor/1 when it comes to processes. It is compatible with the various ways that Process.monitor/1 can establish monitors and will accept one of a pid, a name which is the atom that a local process is registered under, or a tuple of {name, node} for a registered process on a remote node. These are defined as the ZenMonitor.destination type.

ZenMonitor.monitor/1 returns a standard reference that can be used to demonitor and can be matched against the reference provided in the :DOWN message.

Similar to Process.monitor/1, the caller is allowed to monitor the same process multiple times, each monitor will be provided with a unique reference and all monitors will fire :DOWN messages when the monitored process goes down. Even though the caller can establish multiple monitors, ZenMonitor is designed to handle this efficiently, the only cost is an additional ETS row on the local node and additional processing time at fan-out.

ZenMonitor.demonitor/2

This is a mostly drop-in replacement for Process.demonitor/2 when it comes to processes. The first argument is the reference returned by ZenMonitor.monitor/1. It accepts a list of option atoms, but only honors the :flush option at this time. Passing the :info option is allowed but has no effect, this function always returns true.

ZenMonitor.compatibility/1

When operating in a mixed environment where some nodes are ZenMonitor compatible and some are not, it may be necessary to check the compatibility of a remote node. ZenMonitor.compatibility/1 accepts any ZenMonitor.destination and will report back one of :compatible or :incompatible for the remote's cached compatibility status.

All remotes start off as :incompatible until a positively acknowledged connection is established. See the ZenMonitor.connect/1 function for more information on connecting nodes.

ZenMonitor.compatibility_for_node/1

Performs the same operation as ZenMonitor.compatibility/1 but it accepts a node atom instead of a ZenMonitor.destination.

ZenMonitor.connect/1

Attempts a positive connection with the provided remote node. Connections are established by using the @gen_module's call/4 method to send a :ping message to the process registered under the atom ZenMonitor.Proxy on the remote. If this process responds with a :pong atom then the connection is positively established and the node is marked as :compatible. Any other response or error condition (timeout / noproc / etc) will be considered negative acknowledgement.

ZenMonitor.connect/1 is actually a delegate for ZenMonitor.Local.Connector.connect/1 see the documentation there for more information about how connect behaves.

Handling Down Messages

Any :DOWN message receivers (most commonly GenServer.handle_info/2 callbacks) that match on the reason should be updated to include an outer {:zen_monitor, original_match} wrapper.

def handle_info({:DOWN, ref, :process, pid, :specific_reason}, state) do
  ...
end

Should be updated to the following.

def handle_info({:DOWN, ref, :process, pid, {:zen_monitor, :specific_reason}}, state) do
  ...
end

Why?

ZenMonitor was developed at Discord to improve the stability of our real-time communications infrastructure. ZenMonitor improves stability in a couple of different ways.

Traffic Calming

When a process is being monitored by a large number of remote processes, that process going down can cause both the node hosting the downed process and the node hosting the monitoring processes to be suddenly flooded with an large amount of work. This is commonly referred to as a thundering herd and can overwhelm either node depending on the situation.

ZenMonitor relies on interval batching and GenStage to help calm the deluge into a throttled stream of :DOWN messages that may take more wall clock time to process but has more predictable scheduler utilization and network consumption.

Message Interspersing

In the inverse scenario, a single process monitoring a large number of remote processes, a systemic failure of a large number of monitored processes can result in blocking the message queue. This can cause other messages being sent to the process to backup behind the :DOWN messages.

Here's what a message queue might look like if 100,000 monitors fired due to node failure.

+------------------------------------------------+
|    {:DOWN, ref, :process, pid_1, :nodedown}    |
+------------------------------------------------+
|    {:DOWN, ref, :process, pid_2, :nodedown}    |
+------------------------------------------------+
...             snip 99,996 messages           ...
+------------------------------------------------+
| {:DOWN, ref, :process, pid_99_999, :nodedown}  |
+------------------------------------------------+
| {:DOWN, ref, :process, pid_100_000, :nodedown} |
+------------------------------------------------+
|                     :work                      |
+------------------------------------------------+
|                     :work                      |
+------------------------------------------------+
|                     :work                      |
+------------------------------------------------+
...                    etc                     ...

The process has to process the 100,000 :DOWN messages before it can get back to doing work, if the processing of a :DOWN message is non-trivial then this could result in the process effectively appearing unresponsive to callers expecting it to do :work.

ZenMonitor.Local.Dispatcher provides a configurable batch sweeping system that dispatches a fixed demand_amount of :DOWN messages every demand_interval (See the documentation for ZenMonitor.Local.Dispatcher for configuration and defaults). Using ZenMonitor the message queue would look like this.

+------------------------------------------------+
|    {:DOWN, ref, :process, pid_1, :nodedown}    |
+------------------------------------------------+
...             snip 4,998 messages           ...
+------------------------------------------------+
|  {:DOWN, ref, :process, pid_5000, :nodedown}   |
+------------------------------------------------+
|                     :work                      |
+------------------------------------------------+
...    snip messages during demand_interval    ...
+------------------------------------------------+
|                     :work                      |
+------------------------------------------------+
|  {:DOWN, ref, :process, pid_5001, :nodedown}   |
+------------------------------------------------+
...             snip 4,998 messages           ...
+------------------------------------------------+
| {:DOWN, ref, :process, pid_10_000, :nodedown}  |
+------------------------------------------------+
|                     :work                      |
+------------------------------------------------+
...    snip messages during demand_interval    ...
+------------------------------------------------+
|                     :work                      |
+------------------------------------------------+
...                    etc                     ...

This means that the process can continue processing work messages while working through more manageable batches of :DOWN messages, this improves the effective responsiveness of the process.

Message Truncation

:DOWN messages include a reason field that can include large stack traces and GenServer state dumps. Large reasons generally don't pose an issue, but in a scenario where thousands of processes are monitoring a process that generates a large reason the cumulative effect of duplicating the large reason to each monitoring process can consume all available memory on a node.

When a :DOWN message is received for dispatch to remote subscribers, the first step is to truncate the message using ZenMonitor.Truncator, see the module documentation for more information about how truncation is performed and what configuration options are supported.

This prevents the scenario where a single process with a large stack trace or large state gets amplified on the receiving node and consumes an large amount of memory.

Design

ZenMonitor is constructed of two cooperating systems, the Local ZenMonitor System and the Proxy ZenMonitor System. When a process wishes to monitor a remote process, it should inform the Local ZenMonitor System which will efficiently dispatch the monitoring request to the remote node's Proxy ZenMonitor System.

Local ZenMonitor System

The Local ZenMonitor System is composed of a few processes, these are managed by the ZenMonitor.Local.Supervisor. The processes that comprise the Local ZenMonitor System are described in detail in the following section.

ZenMonitor.Local

ZenMonitor.Local is responsible for accepting monitoring and demonitoring requests from local processes. It will send these requests to the Connector processes for efficient transmission to the responsible ZenMonitor.Proxy processes.

When a monitored process dies, the ZenMonitor.Proxy will send this information in a summary message to the ZenMonitor.Local.Connector process which will use the send down_dispatches to ZenMonitor.Local for eventual delivery by the ZenMonitor.Local.Dispatcher.

ZenMonitor.Local is also responsible for monitoring the local interested process and performing clean-up if the local interested process crashes for any reason, this prevents the Local ZenMonitor System from leaking memory.

ZenMonitor.Local.Tables

This is a simple process that is responsible for owning shared ETS tables used by various parts of the Local ZenMonitor System.

It maintains two tables, ZenMonitor.Local.Tables.Nodes and ZenMonitor.Local.Tables.References these tables are public and are normally written to and read from by the ZenMonitor.Local and ZenMonitor.Local.Connector processes.

ZenMonitor.Local.Connector

ZenMonitor.Local.Connector is responsible for batching monitoring requests into summary requests for the remote ZenMonitor.Proxy. The Connector handles the actual distribution connection to the remote ZenMonitor.Proxy including dealing with incompatible and down nodes.

When processes go down on the remote node, the Proxy ZenMonitor System will report summaries of these down processes to the corresponding ZenMonitor.Local.Connector.

There will be one ZenMonitor.Local.Connector per remote node with monitored processes.

ZenMonitor.Local.Dispatcher

When a remote node or remote processes fail, messages will be enqueued for delivery. The ZenMonitor.Local.Dispatcher is responsible for processing these enqueued messages at a steady and controlled rate.

Proxy ZenMonitor System

The Proxy ZenMonitor System is composed of a few processes, these are managed by the ZenMonitor.Proxy.Supervisor. The processes that comprise the Proxy ZenMonitor System are described in detail in the following section.

ZenMonitor.Proxy

ZenMonitor.Proxy is responsible for handling subscription requests from the Local ZenMonitor System and for maintaining the ERTS Process Monitors on the processes local to the remote node.

ZenMonitor.Proxy is designed to be efficient with local monitors and will guarantee that for any local process there is, at most, one ERTS monitor no matter the number remote processes and remote nodes are interested in monitoring that process.

When a local process goes down ZenMonitor.Proxy will enqueue a new death certificate to the ZenMonitor.Proxy.Batcher processes that correspond to the interested remotes.

ZenMonitor.Proxy.Tables

This is a simple process that is responsible for owning shared ETS tables used by various parts of the Proxy ZenMonitor System.

It maintains a single table, ZenMonitor.Proxy.Tables.Subscribers. This table is used by both the ZenMonitor.Proxy and ZenMonitor.Proxy.Batcher processes.

ZenMonitor.Proxy.Batcher

This process has two primary responsibilities, collecting and summarizing death certificates and monitoring the remote process.

For every remote ZenMonitor.Local.Connector that is interested in monitoring processes on this node, a corresponding ZenMonitor.Proxy.Batcher is spawned that will collect and ultimately deliver death certificates. The ZenMonitor.Proxy.Batcher will also monitor the remote ZenMonitor.Local.Connector and clean up after it if it goes down for any reason.

Running a Compatible Node

ZenMonitor ships with an Application, ZenMonitor.Application which will start the overall supervisor, ZenMonitor.Supervisor. This creates a supervision tree as outlined below.

                                                                            -------------------------
                                                                      +----| ZenMonitor.Local.Tables |
                                                                      |     -------------------------
                                                                      |
                                                                      |     ------------------
                                                                      +----| ZenMontior.Local |
                                    -----------------------------     |     ------------------
                              +----| ZenMonitor.Local.Supervisor |----|
                              |     -----------------------------     |     -------------       ----------------------------
                              |                                       +----| GenRegistry |--N--| ZenMonitor.Local.Connector |
                              |                                       |     -------------       ----------------------------
                              |                                       |
                              |                                       |     -----------------------------
                              |                                       +----| ZenMonitor.Local.Dispatcher |
                              |                                             -----------------------------
  -----------------------     |
 | ZenMonitor.Supervisor |----|
  -----------------------     |                                             -------------------------
                              |                                       +----| ZenMonitor.Proxy.Tables |
                              |                                       |     -------------------------
                              |                                       |
                              |     -----------------------------     |     ------------------
                              +----| ZenMonitor.Proxy.Supervisor |----+----| ZenMonitor.Proxy |
                                    -----------------------------     |     ------------------
                                                                      |
                                                                      |     -------------       --------------------------
                                                                      +----| GenRegistry |--M--| ZenMonitor.Proxy.Batcher |
                                                                            -------------       --------------------------

More Repositories

1

discord-api-docs

Official Discord API Documentation
Markdown
5,543
star
2

lilliput

Resize images and animated GIFs in Go
C++
1,923
star
3

manifold

Fast batch message passing between nodes for Erlang/Elixir.
Elixir
1,618
star
4

sorted_set_nif

Elixir SortedSet backed by a Rust-based NIF
Elixir
1,532
star
5

discord-open-source

List of open source communities living on Discord
JavaScript
1,319
star
6

focus-rings

A centralized system for displaying and stylizing focus indicators anywhere on a webpage.
TypeScript
1,120
star
7

fastglobal

Fast no copy globals for Elixir & Erlang.
Elixir
1,097
star
8

discord-rpc

C++
983
star
9

airhornbot

The only bot for Discord you'll ever need.
TypeScript
851
star
10

semaphore

Fast semaphore using ETS.
Elixir
718
star
11

react-dnd-accessible-backend

An add-on backend for `react-dnd` that provides support for keyboards and screenreaders by default.
TypeScript
576
star
12

ex_hash_ring

A fast consistent hash ring implementation in Elixir.
Elixir
475
star
13

discord-example-app

Basic Discord app with examples
JavaScript
434
star
14

OverlappingPanels

Overlapping Panels is a gestures-driven navigation UI library for Android
Kotlin
420
star
15

SimpleAST

Extensible Android library for both parsing text into Abstract Syntax Trees and rendering those trees as rich text.
Kotlin
360
star
16

discord-interactions-js

JS/Node helpers for Discord Interactions
TypeScript
345
star
17

instruments

Simple and Fast metrics for Elixir
Elixir
295
star
18

focus-layers

Tiny React hooks for isolating focus within subsections of the DOM.
TypeScript
292
star
19

discord-api-spec

OpenAPI specification for Discord APIs
237
star
20

discord-oauth2-example

Discord OAuth2 Example
Python
225
star
21

loqui

RPC Transport Layer - with minimal bullshit.
Rust
220
star
22

erlpack

High Performance Erlang Term Format Packer
Cython
211
star
23

cloudflare-sample-app

Example discord bot using Cloudflare Workers
JavaScript
197
star
24

access

Access, a centralized portal for employees to transparently discover, request, and manage their access for all internal systems needed to do their jobs
Python
190
star
25

use-memo-value

Reuse the previous version of a value unless it has changed
TypeScript
170
star
26

deque

Fast bounded deque using two rotating lists.
Elixir
141
star
27

avatar-remix-bot

TypeScript
127
star
28

linked-roles-sample

JavaScript
119
star
29

punt

Punt is a tiny and lightweight daemon which helps ship logs to Elasticsearch.
Go
113
star
30

embedded-app-sdk

🚀 The Discord Embedded App SDK lets you build rich, multiplayer experiences as Activities inside Discord.
TypeScript
109
star
31

sample-game-integration

An example using Discord's API and local RPC socket to add Voice and Text chat to an instance or match based multiplayer game.
JavaScript
105
star
32

endanger

Build Dangerfiles with ease.
TypeScript
96
star
33

discord-interactions-python

Useful tools for building interactions in Python
Python
93
star
34

react-base-hooks

Basic utility React hooks
TypeScript
77
star
35

dynamic-pool

a lock-free, thread-safe, dynamically-sized object pool.
Rust
76
star
36

itsdangerous-rs

A rust port of itsdangerous!
Rust
72
star
37

gen_registry

Simple and efficient local Process Registry
Elixir
71
star
38

confetti-cannon

Launch Confetti
TypeScript
45
star
39

discord-react-forms

Forms... in React
JavaScript
43
star
40

discord-interactions-php

PHP utilities for building Discord Interaction webhooks
PHP
40
star
41

babel-plugin-define-patterns

Create constants that replace various expressions at build-time
JavaScript
39
star
42

eslint-traverse

Create a sub-traversal of an AST node in your ESLint plugin
JavaScript
30
star
43

rapidxml

Mirror of rapidxml from http://rapidxml.sourceforge.net/
C++
29
star
44

memory_size

Elixir
29
star
45

gamesdk-and-dispatch

Public issue tracker for the Discord Game SDK and Dispatch
22
star
46

dispenser

Elixir library to buffer and send events to subscribers.
Elixir
17
star
47

eslint-plugin-discord

Custom ESLint rules for Discord
JavaScript
16
star
48

chromium-build

Python
15
star
49

hash_ring

hash_ring
C
14
star
50

limited_queue

Simple Elixir queue, with a constant-time `size/1` and a maximum capacity
Elixir
13
star
51

perceptual

A smarter volume slider scale
TypeScript
13
star
52

discord_vigilante

12
star
53

heroku-sample-app

Example discord bot using Heroku
JavaScript
11
star
54

postcss-theme-shorthand

Converts `light-` and `dark-` prefixed CSS properties into corresponding light/dark theme globals
JavaScript
11
star
55

babel-plugin-strip-object-freeze

Replace all instances of Object.freeze(value) with value
JavaScript
10
star
56

libyuv

our fork of libyuv for webrtc
C++
10
star
57

lilliput-rs

Lilliput, in Rust!
Rust
9
star
58

lilliput-bench

Benchmarker for lilliput
Python
8
star
59

sqlite3

Mirror of sqlite amalgamation from https://www.sqlite.org/
C
7
star
60

openh264

C++
6
star
61

slate-react-package-fork

TypeScript
6
star
62

rlottiebinding-ios

rlottie ios submodule
Starlark
5
star
63

jemalloc_info

A small library for exporting jemalloc allocation data in Elixir
Elixir
5
star
64

libnice

Fork of https://nice.freedesktop.org/wiki/
C
5
star
65

libvpx

C
4
star
66

libsrtp

Fork of libsrtp
C
4
star
67

RLottieAndroid

C++
4
star
68

opentelemetry-rust-datadog

Rust
4
star
69

slate-package-fork

JavaScript
4
star
70

lilliput-dep-source

Convenient source repo for Lilliput's dependencies
3
star
71

slate-hotkeys-package-fork

3
star
72

rules_ios

Bazel rules for building iOS applications and frameworks
Starlark
2
star
73

cocoapods-bazel

A Cocoapods plugin for automatically generating Bazel BUILD files
Ruby
2
star
74

eslint-plugin-react-discord

Fork of eslint-plugin-react
JavaScript
1
star