• Stars
    star
    154
  • Rank 240,638 (Top 5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 6 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A JVMTI agent that attaches to your JVM and kills it when things go sideways

Build Status

jvmquake

A JVMTI agent that attaches to your JVM and automatically signals and kills it when the program has become unstable.

The name comes from "jvm earthquake" (a play itself on hotspot).

This project is heavily inspired by airlift/jvmkill written by David Phillips <[email protected]> but adds the additional innovation of a GC instability detection algorithm for when a JVM is unstable but not quite dead yet (aka "GC spirals of death").

Production Quality

This agent has a thorough test suite and error handling, and has been demonstrated in production to be superior to the built in JVM options. Netflix currently (2019-11-11) run this software attached to a very large number of Cassandra and Elasticsearch JVMs.

A detailed motivation is below. To just start using jvmquake, skip to Building and Usage for how to build and use this agent.

Motivation

Java Applications, especially databases such as Elasticsearch and Cassandra can easily enter GC spirals of death, either resulting in eventual OOM or Concurrent Mode Failures (aka "CMF" per CMS parlance although G1 has similar issues with frequent mixed mode collections). Concurrent mode failures, when the old gen collector is running frequently expending a lot of CPU resources but is still able to reclaim enough memory so that the application does not cause a full OOM, are particularly pernicious as they appear as 10-30s "partitions" (duration is proportional to heap size) which repeatedly form and heal ...

This grey failure mode wreaks havoc on distributed systems. In the case of databases it can lead to degraded performance or even data corruption. General jvm applications that use distributed locks to enter a critical section may make incorrect decisions under the assumption they have a lock when they in fact do not (e.g. if the application pauses for 40s and then continues executing assuming it still held a lock in Zookeeper).

As pathological heap situations are so problematic, the JVM has various flags to try to address these issues:

  • OnOutOfMemoryError: Commonly used with kill -9 %p. This options sometimes works but most often results in no action, especially when the JVM is out of file descriptors and can't execute the command at all. As of Java 8u92 there is a better option in the ExitOnOutOfMemoryError option below. This option furthermore does not handle excessive GC.
  • ExitOnOutOfMemoryError and CrashOnOutOfMemoryError: Both options were added as part of JDK-8138745 Both work great for dealing with running out memory, but do not handle other edge cases such as running out of threads. Also naturally these do nothing when you are in the "grey" failure mode of CMF.

There are also some options that are supposed to control GC overhead:

  • GCHeapFreeLimit, GCTimeLimit and +UseGCOverheadLimit. These options are supposed to cause an OOM in the case where we are not collecting enough memory, or are spending too much time in GC. However in practice I've never been able to get these to work well, and GCOverheadLimit is afaik only supported in CMS.

TLDR: In my experience these JVM flags are hard to tune and only sometimes work if they work at all, and often are limited to a subset of JVMs of collectors.

Premise of jvmquake

jvmquake is designed with the following guiding principles in mind:

  1. If my JVM becomes totally unusable (OOM, out of threads, etc), I want it to die.
  2. If my JVM spends excessive time garbage collecting, I want it to die.
  3. I may want to be able to debug why my JVM ran out of memory (e.g. heap dumps or core dumps). I may want jvmquake to signal me that JVM is in trouble before it kills it so I can start gathering additional diagnostics.
  4. This should work on any JVM (Java 6, Java 7, Java 8, Java 11, w.e.).

These principles are in alignment with Crash Only Software (background) which implores us to crash when we encounter bugs instead of limping along.

Knobs and Options

jvmquake has three options passed as comma delimited integers <threshold>,<runtime_weight>,<action>:

  • threshold (default: 30): the maximum GC "deficit" which can be accumulated before jvmquake takes action, specified in seconds.
  • runtime_weight (default: 5): the factor by which to multiply running JVM time, when weighing it against GCing time. "Deficit" is accumulated as gc_time - runtime * runtime_weight, and is compared against threshold to determine whether to take action. (default: 5)
  • action (default: 0): what action should be taken when threshold is exceeded. If zero, jvmquake attempts to produce an OOM within the JVM (allowing standard OOM handling such as HeapDumpOnOutOfMemoryError to trigger). If nonzero, jvmquake raises that signal number as an OS-level signal. Regardless of the action, the JVM is then forcibly killed via a SIGKILL.

In addition, jvmquake supports keyword arguments passed as comma separated key=value pairs in a fourth argument, so int,int,int,key1=value1,key2=value2. The currently supported key value pairs are:

  • warn (type: int, default: maxint): an amount of GC "deficit" (analogous to threshold) which will cause jvmquake to touch a file (see touch) before it kills the JVM. The default setting is not to warn.
  • touch (type: string, default: /tmp/jvmquake_warn_gc): The file path that jvmquake should open (creating if necessary) and update the access and modification time on when there is more than warn GC "deficit".

Algorithm Details

To achieve our goal, we build on jvmkill. In addition to dying when we see a ResourceExhausted event, jvmquake keeps track of every GC entrance and exit that pause the application using GarbageCollectionStart and GarbageCollectionFinish. jvmquake then keeps a token bucket algorithm to keep track of how much time is spent GCing relative to running application code. Note that per the JVMTI spec these only track stop the world pausing phases of collections. . The following pseudocode is essentially all of jvmquake:

# The bucket for keeping track of relative running and non running time
token_bucket : int = 0
# The amount of weight to give running seconds over GCing seconds. This defines
# our expected application throughput
runtime_weight : int = 5
# The amount of time that we must exceed the expected throughput by
# before triggering the signal and death actions
gc_threshold : int = 30

# Time bookkeeping
last_gc_start : int = current_time()
last_gc_end : int = current_time()

def on_gc_start()
    last_gc_start = current_time()
    time_running = (last_gc_start - last_gc_end)
    token_bucket = max(0, token_bucket - (time_running * runtime_weight))

def on_gc_end()
    last_gc_end = current_time()
    time_gcing = (last_gc_end - last_gc_start)
    token_bucket += time_gcing

    if token_bucket > gc_threshold:
        take_action()

The warn and touch options just touch a file (specified by touch) when the token_bucket exceeds the warning gc threshold instead of the kill threshold.

Building and Usage

As jvmquake is a JVMTI C agent (so that it lives outside the heap and cannot be affected by GC behavior), you must compile it before using it against your JVM. You can either do this on the machine running the Java project or more commonly in an external build that generates the .so or a package such as a .deb. The generated .so depends only your architecture and libc and should work with any JDK newer than the one you compiled it with on the same platform (so e.g. linux-x86_64 will work on all x86_64 linux systems).

# Compile jvmquake against the JVM the application is using. If you do not
# provide the path, the environment variable JAVA_HOME is used instead

make JAVA_HOME=/path/to/jvm

For example if the Oracle Java 8 JVM is located at /usr/lib/jvm/java-8-oracle:

make JAVA_HOME=/usr/lib/jvm/java-8-oracle

The agent is now available at build/libjvmquake-<platform>.so. For example, on a linux machine you should get libjvmquake-linux-x86_64.so and on mac you might see libjvmquake-darwin-x86_64.so.

Note: A libjvmquake.so built from source like this is portable to all JVMs that implement the same JVMTI specification. In practice I find the same .so works fine with Java 8, 9, 11 and I imagine it will work until Java changes the spec.

See Testing for the full set of platforms that we test against.

Build Status

How to Use the Agent

Once you have the agent library, run your java program with agentpath or agentlib to load it.

java -agentpath:/path/to/libjvmquake.so <your java program here>

If you have installed the .so to /usr/lib (for example using a debian package) you can just do java -agentpath:libjvmquake.so.

The default settings are 30 seconds of GC deficit with a 1:5 gc:running time weight, and the default action is to trigger an in-JVM OOM. These defaults are reasonable for a latency critical java application.

If you want different settings you can pass options per the option specification.

java -agentpath:/path/to/libjvmquake.so=<options> <your java program here>

Some examples:

If you want to cause a java level OOM when the program exceeds 30 seconds of deficit where running time is equally weighted to gc time:

java -agentpath:/path/to/libjvmquake.so=30,1,0 <your java program here>

If you want to trigger an OS core dump and then die when the program exceeds 30 seconds of deficit where running time is 5:1 weighted to gc time:

java -agentpath:/path/to/libjvmquake.so=30,1,6 <your java program here>

If you want to trigger a SIGKILL immediately without any form of diagnostics:

java -agentpath:/path/to/libjvmquake.so=30,1,9 <your java program here>

If you want to trigger a SIGTERM without any form of diagnostics:

java -agentpath:/path/to/libjvmquake.so=30,1,15 <your java program here>

If you want to cause a java level OOM when the program exceeds 60 seconds of deficit where running time is 10:1 weighted to gc time:

java -agentpath:/path/to/libjvmquake.so=60,10,0 <your java program here>

If you want to trigger a SIGKILL immediately after a 30s GC deficit accrues and touch /tmp/jvmquake after any 1s GC pause or more (presumably to inform a watching process to fire off some kind of profiler or other diagnostics):

java -agentpath:/path/to/libjvmquake.so=30,1,9,warn=1,touch=/tmp/jvmquake <your java program here>

Testing

jvmquake comes with a test suite of OOM conditions (running out of memory, threads, gcing too much, etc) which you can run if you have a jdk, tox and python3 available:

# Run the test suite which uses tox, pytest, and plumbum under the hood
# to run jvmquake through numerous difficult failure modes
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ make test

If you have docker you can also run specific environment tests that bundle all dependencies for a platform into a single dockerized build:

# Run the Ubuntu bionic openjdk8 test suite via Docker
make test_bionic_openjdk8

There is also a test suite in tests/test_java_opts.py which shows that the standard JVM options do not work to remediate the situations jvmquake handles.

Automated Tests

We currently test every commit and the released .so generated with OpenJDK 8 against the following platforms:

  • Ubuntu Xenial with OpenJDK8
  • Ubuntu Bionic with OpenJDK8
  • Ubuntu Bionic with OpenJDK11
  • Ubuntu Bionic with Zulu8
  • Ubuntu Bionic with Zulu11
  • Ubuntu Focal with OpenJDK8
  • Ubuntu Focal with OpenJDK11
  • Centos7 with OpenJDK8

Build Status

More Repositories

1

Scumblr

Web framework that allows performing periodic syncs of data sources and performing analysis on the identified results
Ruby
2,643
star
2

stethoscope

Personalized, user-focused recommendations for employee information security.
Python
2,002
star
3

sleepy-puppy

Sleepy Puppy XSS Payload Management Framework
JavaScript
1,029
star
4

sketchy

A task based API for taking screenshots and scraping text from websites.
JavaScript
996
star
5

diffy

β›” (DEPRECATED) Diffy is a triage tool used during cloud-centric security incidents, to help digital forensics and incident response (DFIR) teams quickly identify suspicious hosts on which to focus their response.
Python
632
star
6

riskquant

Python
609
star
7

aardvark

Aardvark is a multi-account AWS IAM Access Advisor API
Python
470
star
8

stethoscope-app

A desktop application that checks security-related settings and makes recommendations for improvements without requiring central device management or automated reporting.
JavaScript
457
star
9

policyuniverse

Parse and Process AWS IAM Policies, Statements, ARNs, and wildcards.
Python
421
star
10

zerotodocker

Dockerfiles to be used to create Dockerhub trusted builds of NetflixOSS
Python
407
star
11

rewrite

Distributed code search and refactoring for Java
Java
291
star
12

gcviz

Garbage Collector Visualization Tool/Framework
Python
266
star
13

repulsive-grizzly

Application Layer DoS Testing Framework
Python
244
star
14

hystrix-dashboard

JavaScript
233
star
15

zerotocloud

Scripts and instructions for Zero To Cloud With NetflixOSS
Groovy
147
star
16

bpftoolkit

Shell
128
star
17

aws-credential-compromise-detection

Example detection of compromise credentials in AWS
Python
118
star
18

WSPerfLab

Project for testing web-service implementations.
Java
116
star
19

UnrealValidationFramework

C++
111
star
20

cloudy-kraken

AWS Red Team Orchestration Framework
Python
102
star
21

historical

A serverless, event-driven AWS configuration collection service with configuration versioning.
Python
93
star
22

jenkins-cli

Simple Jenkins Command Line Interface
Perl
91
star
23

swag-client

Cloud multi-account metadata management tool.
Python
87
star
24

cloudtrail-anomaly

Python
82
star
25

cloudaux

Cloud Auxiliary is a python wrapper and orchestration module for interacting with cloud providers
Python
76
star
26

aws-metadata-proxy

AWS Metadata Proxy for protection against SSRF
Go
69
star
27

service-capacity-modeling

Python
61
star
28

titus-isolate

Python
55
star
29

skunky

Marking instances dirty since 2018
Python
47
star
30

raven-python-lambda

Sentry/Raven SDK Integration For AWS Lambda (python) and Serverless
Python
47
star
31

dynaslave-plugin

Jenkins DynaSlave plugin
Java
46
star
32

s3-flash-bootloader

A tool for flashing OS images onto stateful servers
Shell
45
star
33

rl_for_budget_constrained_recs

Jupyter Notebook
41
star
34

logstash-configs

Logstash Configs used by Netflix
31
star
35

spectatord

A high performance metrics daemon
C++
23
star
36

framerate-utils

Useful conversion utilities for working with video frame rate and display
TypeScript
17
star
37

qiro

The Qiro Project
Java
17
star
38

listening-test-app

C++
16
star
39

iep-apps

Example apps using Netflix Insight libraries from the Spectator, Atlas, and IEP projects.
Scala
15
star
40

zerotocloud-gradle

Gradle Plugin to Initialize the Cloud Environment and Utilize it for Continuous Delivery Purposes
Groovy
15
star
41

stethoscope-examples

Example Express application for collecting data from the Stethoscope app
HTML
14
star
42

bucketsnake

An AWS lambda function that grantsss S3 permissionsss at ssscale.
Python
14
star
43

causaltransportr

R package to generalize and transport causal effects.
R
12
star
44

Numerus

Counters, Percentiles, etc for in-memory metrics capture.
Java
12
star
45

mesos-on-pi

Shell
12
star
46

nfflink-connector-iceberg

Java
11
star
47

swag-api

REST API and UI for SWAG data
Python
10
star
48

repokid-extras

Python
10
star
49

raven-sqs-proxy

A Raven/Sentry SQS message proxy forwarder
Python
10
star
50

atlas-node-client

C++
10
star
51

post2crucible

Crucible code review uploader client
Java
8
star
52

StethoscopeMobile

JavaScript
8
star
53

netflixoss-dsl-seed

DSL Scripts to create build jobs for @NetflixOSS projects
Groovy
7
star
54

grails-jade

Grails plugin for rendering Jade templates with the spring-jade4j library
Groovy
6
star
55

spectator-js-nodejsmetrics

Generate node.js internal metrics using the nflx-spectator node module
JavaScript
6
star
56

historical-reports

Lambda functions to generate report artifacts from Historical
Python
6
star
57

cligraphy

Python
5
star
58

atlas-system-agent

Agent that reports system metrics through SpectatorD.
C++
5
star
59

node-pagerduty-netflix

pagerduty REST API interface in node.js
JavaScript
5
star
60

swag-functions

Lambda functions for SWAG management
Python
4
star
61

grails-context-param

Grails plugin to automatically add parameters specified as @ContextParam on a controller to redirect calls.
Groovy
4
star
62

ng-nflx

Miscellaneous utilities for AngularJS
JavaScript
4
star
63

hive2iceberg-migration

Scala
3
star
64

ec2blockdevcfg

Tools and configuration for Amazon EC2 NVMe block devices
Python
2
star
65

kmd

JavaScript
2
star
66

atlas-native-client

C++
2
star
67

qiro-logo

Code for generating the qiro logo
Java
2
star
68

corepipe

Rust
2
star
69

flagpole

Flag arg parser to build out a dictionary with optional keys.
Python
1
star
70

scumblr-spillguard

Python
1
star
71

adversarial_approach_to_recommender_systems

Python
1
star