• Stars
    star
    289
  • Rank 143,419 (Top 3 %)
  • Language
    TypeScript
  • License
    Other
  • Created over 4 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Kubernetes node connectivity monitoring tool

kmoncon - Monitoring connectivity between your kubernetes nodes

A Kubernetes node connectivity tool that preforms frequent tests (tcp, udp and dns), and exposes Prometheus metrics that are enriched with the node name, and the locality information (such as zone), enabling you to correlate issues between availability zones or nodes.

The idea is this information supplements any other L7 monitoring you use, such as Istio observability or Kube State Metrics, to help you get to the root cause of a problem faster.

It's really performant, considering the number of tests it is doing, on my clusters of 75 nodes, the agents have a mere 60m CPU/40mb RAM resource request.

Once you've got it up and going, you can plot some pretty dashboards like this:

grafana

PS. I've included a sample dashboard here to get you going

Known Issues:

  • It's super, mega pre-alpha, the product of a weekends experimentation - so don't expect it to be perfect. I plan to improve it but wanted to get something out there to people who wanted it.
  • It's written in nodejs which means the docker image is 130mb. That's not huge, but it isn't golang small either.
  • If you've got nodes coming up and down frequently, eventual consistency means that you might get some test failures as an agent is testing a node that's gone (but is yet to get an updated agent list). I plan to tackle this with push agent updates.

Architecture

The application consists of two components.

Agent

This agent runs a Daemonset agent on Kubernetes clusters, and requires minimal permissions to run. The agents purpose is to periodically run tests against the other agents, and expose the results as metrics.

The agent also spawns with an initContainer, which sets some sysctl tcp optimisations. You can disable this behaviour in the the helm values file.

Controller

In order to discover other agents, and enrich the agent information with metadata about the node and availability zone, the controller constantly watches the kubernetes API and maintains the current state in memory. The agents connect to the controller when they start, to get their own metadata, and then every 5 seconds in order to get an up to date agent list.

NOTE: Your cluster needs RBAC enabled as the controller uses in-cluster service-account authentication with the kubernetes master.

Testing

kconmon does a variety of different tests, and exposes the results as prometheus metrics enriched with the node and locality information. The interval is configurable in the helm chart config, and is subject to a 50-500ms jitter to spread the load.

UDP Testing

kmoncon agents by default will perform 5 x 4 byte UDP packet tests between every other agent, every 5 seconds. Each test waits for a response from the destination agent. The RTT timeout is 250ms, anything longer than that and we consider the packets lost in the abyss. The metrics output from UDP tests are:

  • GAUGE kconmon_udp_duration_milliseconds: The total RTT from sending the packet to receiving a response
  • GAUGE kconmon_udp_duration_variance_milliseconds: The variance between the slowest and the fastest packet
  • GAUGE kconmon_udp_loss: The percentage of requests from the batch that failed
  • COUNTER kconmon_udp_results_total: A Counter of test results, pass and fail

TCP Testing

kmoncon angets will perform a since HTTP GET request between every other agent, every 5 seconds. Each connection is terminated with Connection: close and Nagle's Algorithm as disabled to ensure consistency across tests.

The metrics output from TCP tests are:

  • GAUGE kconmon_tcp_connect_milliseconds: The duration from socket assignment to successful TCP connection of the last test run
  • GAUGE kconmon_tcp_duration_milliseconds: The total RTT of the request
  • COUNTER kconmon_tcp_results_total: A Counter of test results, pass and fail

DNS Testing

kconmon agents will perform DNS tests by defualt every 5 seconds. It's a good idea to have tests for a variety of different resolvers (eg kube-dns, public etc).

The metrics output from DNS tests are:

  • GAUGE kconmon_dns_duration_milliseconds: The duration of the last test run
  • COUNTER kconmon_dns_results_total: A Counter of test results, pass and fail

Prometheus Metrics

The agents expose a metric endpoint on :8080/metrics, which you'll need to configure Prometheus to scrape. Here is an example scrape config:

- job_name: 'kconmon'
  honor_labels: true
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names:
      - kconmon
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app, __meta_kubernetes_pod_label_component]
    action: keep
    regex: "(kconmon;agent)"
  - source_labels: [__address__]
    action: replace
    regex: ([^:]+)(?::\d+)?
    replacement: $1:8080
    target_label: __address__
  metric_relabel_configs:
  - regex: "(instance|pod)"
    action: labeldrop
  - source_labels: [__name__]
    regex: "(kconmon_.*)"
    action: keep

Your other option if you're using the prometheus operator, is to install the helm chart with --set prometheus.enableServiceMonitor=true. This will create you a Service and a ServiceMonitor.

Alerting

You could configure some alerts too, like this one which fires when we have consistent TCP test failures between zones for 2 minutes:

groups:
- name: kconmon.alerting-rules
  rules:
  - alert: TCPInterZoneTestFailure
    expr: |
      sum(increase(kconmon_tcp_results_total{result="fail"}[1m])) by (source_zone, destination_zone) > 0
    labels:
      for: 2m
      severity: warning
      source: '{{ "{{" }}$labels.source_zone{{ "}}" }}'
    annotations:
      instance: '{{ "{{" }}$labels.destination_zone{{ "}}" }}'
      description: >-
        TCP Test Failures detected between one or more zones
      summary: Inter Zone L7 Test Failure

Deployment

The easiest way to install kconmon is with Helm. Head over to the releases page to download the latest chart. Check out the values.yaml for all the available configuration options.

More Repositories

1

helm-test

A mocha based testing CLI for helm packages
TypeScript
25
star
2

ci-in-a-box

An all in one solution to get your team up and running with GoCD, on Kubernetes, on GCP
Shell
21
star
3

kube-gocd

GoCD, running on Kubernetes with "Docker in Docker"
Shell
20
star
4

testyomesh

Continually test your Service Mesh
TypeScript
18
star
5

google-music-sync

A simple Python script to sync your local MP3 library to Google Play Music
Python
17
star
6

bigdata-fun

A complete (distributed) BigData stack, running in containers
Shell
15
star
7

devenv

A complete docker based development environment, with docker-in-docker, using VIM as your editor
Dockerfile
11
star
8

docker-modsec-fluentd

A sidecar for running next to ingress-nginx to scrape modsecurity logs and send them to Elasticsearch
Ruby
10
star
9

ike

IKE in a docker container with a slightly older version of openssl
Shell
9
star
10

docker-nginx-letsencrypt

This is a docker container which automatically generates letsencrypt SSL certificates for you too.
JavaScript
9
star
11

kong-letsencrypt

This is a docker container to generate letsencrypt certificates and sends them to Kong
Shell
6
star
12

Npmzor

An aggregating and caching NPM Registry server
JavaScript
6
star
13

ghost-static

Ghost blog, in a container, with scripts to upload static content to gcs
Dockerfile
5
star
14

migsql

Simple ruby based MSSQL up/down migration
Ruby
4
star
15

osql

A simple object based interface to SQL data sources.
JavaScript
3
star
16

grunt-init-browserify-qunit

Create a browserify application with grunt-init, using QUnit for test driven UI development
JavaScript
2
star
17

hawkeye.ninja

Real time security and vulnerability scanning for your projects, as a service.
JavaScript
2
star
18

workshop-kubernetes

Simple stateful application for the kubernetes workshop
CSS
2
star
19

node-crypt

A simple wrapper to encrypt and decrypt data with nodejs
JavaScript
2
star
20

morm

A lightweight MSSQL orm for node.js
JavaScript
1
star
21

ioredis-encrypted

A wrapper for ioredis that encrypts and decrypts data stored.
JavaScript
1
star
22

docker-nzbget

NZBget in a docker container. Simples.
Ruby
1
star
23

docker-sonarr

This is Sonarr (AKA NZBGet) in a Docker Container
Ruby
1
star
24

docker-plex

Plexmediaserver in a docker container
Shell
1
star
25

sslyze

A Docker implementation of SSLyze
1
star
26

gscout

A docker version of https://www.nccgroup.trust/us/about-us/newsroom-and-events/blog/2017/august/introducing-g-scout/
Shell
1
star
27

ghost

Ghost blogging platform, in Docker, with Disqus
Dockerfile
1
star
28

linkworks

Link Redirection Utility
JavaScript
1
star