• Stars
    star
    193
  • Rank 201,081 (Top 4 %)
  • Language
    C
  • License
    ISC License
  • Created over 14 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Advanced system monitor & process supervisor for Linux

System & Process Supervisor for Linux

License Badge Travis Status Coverity Status

http://toonclips.com/design/788

Table of Contents

Introduction

watchdogd(8) is an advanced system and process supervisor daemon, primarily intended for embedded Linux and server systems. By default it periodically kicks the system watchdog timer (WDT) to prevent it from resetting the system. In its more advanced guise it monitors critical system resources, supervises the heartbeat of processes, records deadline transgressions, and initiates a controlled reset if needed.

When a system starts up, watchdogd determines the reset cause by querying the kernel. In case of system reset, and not power loss, the reset reason is available already in a file for later analysis by an operator or network management system (NMS). This information can in turn can be used to put the system in an operational safe state, or non-operational safe state.

What is a watchdog timer?

Most server and laptop motherboards today come equipped with a watchdog timer (WDT). It is a small timer connected to the reset circuitry so that it can reset the board if the timer expires. The WDT driver, and this daemon, periodically "kick" (reset) the timer to prevent it from firing.

Most embedded systems utilise watchdog timers as a way to automatically recover from malfunctions: lock-ups, live-locks, CPU overload. With a bit of logic sprinkled on top the cause can more easily be tracked down.

The Linux kernel provides a common userspace interface /dev/watchdog, created automatically when the appropriate watchdog driver is loaded. If your board does not have a WDT, the kernel provides a softdog.ko module which in many cases can be good enough.

The idea of a watchdog daemon in userspace is to run in the background of your system. When there is no more CPU time for the watchdog daemon to run it will fail to "kick" the WDT. This will in turn cause the WDT to reboot the system. When it does watchdogd has already saved the reset reason for your post mortem.

As a background process, watchdogd can of course also be used to monitor other aspects of the system ...

What can watchdogd do?

Without arguments watchdogd runs in the background, monitoring the CPU, and as long as there is CPU time it "kicks" the WDT chip (via the driver). If watchdogd is stopped, or does not get enough CPU time to run, the WDT will detect this and reboot the system. This is the normal mode of operation.

With a few lines in /etc/watchdogd.conf, it can also monitor other aspects of the system, such as:

  • Load average
  • Memory leaks
  • File descriptor leaks
  • Process live locks
  • Reset counter, warm boots since last power failure

To top things off there is support for periodically calling a generic script where operators can do housekeeping checks. For details on how to configure this, see the watchdogd.conf(5) man page.

Usage

watchdogd [-hnsVx] [-f FILE] [-T SEC] [-t SEC] [/dev/watchdog]

Options:
  -f, --config=FILE        Use FILE for daemon configuration
  -n, --foreground         Start in foreground (background is default)
  -s, --syslog             Use syslog, even if running in foreground
  -l, --loglevel=LVL       Log level: none, err, info, notice*, debug
  
  -T, --timeout=SEC        HW watchdog timer (WDT) timeout in SEC seconds
  -t, --interval=SEC       WDT kick interval in SEC seconds, default: 10
  -x, --safe-exit          Disable watchdog on exit from SIGINT/SIGTERM
                           "magic" exit may not be supported by HW/driver
  
  -V, --version            Display version and exit
  -h, --help               Display this help message and exit

Without any arguments, watchdogd opens /dev/watchdog, forks to the background, tries to to set a 20 sec WDT timeout, and then kicks every 10 sec. See the Operation section for more information.

Example

watchdogd -T 120 -t 30 /dev/watchdog2

Features

To force a kernel watchdog reboot, watchdogd supports SIGPWR, used by some init(1) systems to delegate a reboot. What it does is to set the WDT timer to the lowest possible value (1 sec), close the connection to /dev/watchdog, and wait for WDT reboot. It waits at most 3x the WDT timeout before announcing HW WDT failure and forcing a reboot.

watchdogd(8) supports optional monitoring of several system resources that can be enabled in the .conf file. First, system load average monitoring can be enabled with:

loadavg {
    enabled  = true
    interval = 300       # Every 5 mins
    warning  = 1.5
    critical = 2.0
}

Second, the memory leak detector, a value of 1.0 means 100% memory use:

meminfo {
    enabled  = true
    interval = 3600       # Every hour
    warning  = 0.9
    critical = 0.95
}

Third, file descriptor leak detector:

filenr {
    enabled  = true
    interval = 3600       # Every hour
    warning  = 0.8
    critical = 0.95
}

All of these monitors can be very useful on an embedded or headless system with little or no operator.

The two values, warning and critical, are the warning and reboot levels in percent. The latter is optional, if it is omitted reboot is disabled. A script can also be run instead of reboot, see the .conf file for details.

Determining suitable system load average levels is tricky. It always depends on the system and use-case, not just the number of CPU cores. Peak loads of 16.00 on an 8 core system may be responsive and still useful but 2.00 on a 2 core system may be completely bogged down. Make sure to read up on the subject and thoroughly test your system before enabling a reboot trigger value. watchdgod uses an average of the first two load average values, the one (1) and five (5) minute. For more information on the UNIX load average, see this StackOverflow question.

The RAM usage monitor only triggers on systems without swap. This is detected by reading the file /proc/meminfo, looking for the SwapTotal: value. For more details on the underlying mechanisms of file descriptor usage, see this article. For more info on the details of memory usage, see this article.

watchdogd v2.0 and later comes with a process supervisor (previously called pmon). When the supervisor is enabled, and the priority is set to a value > 0, the daemon runs as a real-time task with the configured priority. Monitored clients connect to the supervisor using the libwdog API.

supervisor {
    enabled = true
    priority = 98
}

Note: Linux cgroup v2 do not support realtime tasks in sub-groups.

See below for details on how to have your process internal deadlines be supervised.

When a process fails to meet its deadlines, or a monitor plugin reaches critical level, watchdogd initiates a controlled reset. To see the reset reason after reboot, the following section must be enabled in the /etc/watchdogd.conf file:

reset-reason {
    enabled = true
#   file    = /var/lib/watchdogd.state  # default
}

The file setting is optional, the default is usually sufficient, but make sure the destination directory is writable if you change it. You can either inspect the file, or use the watchdogctl tool.

libwdog API

To have watchdogd supervise a process, it must be instrumented with at least a "subscribe" and a "kick" API call. Commonly this is achieved by adding the wdog_kick() call to the main event loop.

All libwdog API functions, except wdog_ping(), return POSIX OK(0) or negative value with errno set on error. The wdog_subscribe() call returns a positive integer (including zero) for the watchdog id.

/*
 * Enable or disable watchdogd at runtime.
 */
int wdog_enable      (int enable);
int wdog_status      (int *enabled);

/*
 * Check if watchdogd API is actively responding,
 * returns %TRUE(1) or %FALSE(0)
 */
int wdog_ping        (void);

/*
 * Register/unregister with process supervisor
 */
int wdog_subscribe   (char *label, unsigned int timeout, unsigned int *ack);
int wdog_unsubscribe (int id, unsigned int ack);
int wdog_kick        (int id, unsigned int timeout, unsigned int ack, unsigned int *next_ack);
int wdog_kick2       (int id, unsigned int *ack);
int wdog_extend_kick (int id, unsigned int timeout, unsigned int *ack);

See wdog.h for detailed API documentation.

It is highly recommended to use an event loop like libev, libuev, or similar. For such libraries one can simply add a timer callback for the kick to run periodically to monitor proper operation of the client.

Example

For other applications, identify your main loop, its max period time and instrument it like this:

int ack, wid;

/* Library will use process' name on NULL first arg. */
wid = wdog_subscribe(NULL, 10000, &ack);
if (-1 == wid)
        ;      /* Error handling */

while (1) {
        ...
        wdog_kick2(wid, &ack);
        ...
}

This example subscribe to the watchdog with a 10 sec timeout. The wid is used in the call to wdog_kick2(), with the received ack value. Which is changed every time the application calls wdog_kick2(), so it is important the correct value is used. Applications should of course check the return value of wdog_subscribe() for errors, that code is left out for readability.

See also the example/ex1.c in the source distribution. This is used by the automatic tests.

Operation

By default, watchdogd forks off a daemon in the background, opens the /dev/watchdog device, attempts to set the default WDT timeout to 20 seconds, and then enters its main loop where it kicks the watchdog every 10 seconds.

If a WDT device driver does not support setting the timeout, watchdogd attempts to query the actual (possibly hard coded) watchdog timeout and then uses half that time as the kick interval.

When watchdogd backgrounds itself, syslog is implicitly used for all informational and debug messages. If a user requests to run the daemon in the foreground watchdogd will also log to STDERR and STDOUT, unless the user gives the --syslog argument to force use of syslog.

The /etc/watchdogd.conf file and the command line control toool watchdogctl can be used to enable more features and query status.

Debugging

The code base has LOG(), INFO() and DEBUG() statements almost everywhere. Use the --loglevel=debug command line option to enable full debug output to stderr or the syslog, depending on how you start watchdogd. The default log level is notice, which enables LOG(), WARN() and error messages.

The watchdogctl debug command can be used at runtime to enable the debug log level, without having to restart a running daemon.

Build & Install

Note: To enable any of the extra monitors and the process supervisor, see ./configure --help

watchdogd is tailored for Linux systems and builds against most modern C libries. However, three external libraries are required: libite, libuEv, and libConfuse. Neither should present any surprises, all of them use de facto standard configure scripts and support pkg-config. The latter is used by the watchdogd configure script use to locate requried libraries and header files.

The common ./configure --some --args --here && make is usually sufficient to build watchdogd. But, if libraries are installed in non-standard locations you may need to provide their paths, e.g. with PKG_CONFIG_PATH. The following also sets the most common install and search paths for the build:

PKG_CONFIG_PATH=/opt/lib/pkgconfig:/home/ian/lib/pkgconfig \
    ./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var
make

If you're not building from a released tarball but instead use the GIT sources, see the Contributing section below.

Origin & References

watchdogd(8) is an improved version of the original, created by Michele d'Amico and adapted to uClinux-dist by Mike Frysinger. It is maintained by Joachim Wiberg collaboratively at GitHub.

The original code in uClinux-dist is available in the public domain, whereas this version is distributed under the ISC license. See the file LICENSE for more details on this.

The logo, "Watch Dog Detective Taking Notes", is licensed for use by the watchdogd project, copyright © Ron Leishman.

Contributing

If you find bugs or want to contribute fixes or features, check out the code from GitHub:

git clone https://github.com/troglobit/watchdogd
cd watchdogd
./autogen.sh

The autogen.sh script runs autoconf, automake, et al to create the configure script and such generated files not part of the VCS tree. For more details, see the file CONTRIBUTING in the GIT sources.

More Repositories

1

inadyn

In-a-Dyn is a dynamic DNS client with multiple SSL/TLS library support
C
782
star
2

finit

Fast init for Linux. Cookies included
C
576
star
3

redir

A TCP port redirector for UNIX
C
349
star
4

mg

Micro (GNU) Emacs-like text editor ❤️ public-domain
C
293
star
5

editline

A small replacement for GNU readline() for UNIX
C
254
star
6

libuev

Lightweight event loop library for Linux epoll() family APIs
C
218
star
7

smcroute

Static multicast routing for UNIX
C
216
star
8

pimd

PIM-SM/SSM multicast routing for UNIX and Linux
C
194
star
9

uftpd

FTP/TFTP server for Linux that just works™
C
162
star
10

mcjoin

Simple multicast testing application
C
130
star
11

merecat

Small and made-easy HTTP/HTTPS server based on Jef Poskanzer's thttpd
C
129
star
12

tetris

Micro Tetris™, based on the 1989 IOCCC Obfuscated Tetris by John Tromp
C
107
star
13

uredir

A UDP port redirector for UNIX
C
87
star
14

myLinux

myLinux is an embedded operating system based on Buildroot and Finit
Shell
73
star
15

sysklogd

BSD syslog daemon with syslog()/syslogp() API replacement for Linux, RFC3164 + RFC5424
C
71
star
16

mtools

Tools for multicast testing (msend and mreceive). I do however recommend you try out mcjoin(!) or mping instead.
C
69
star
17

mrouted

The original DVMRP (dynamic multicast routing) implementation for UNIX
C
65
star
18

mini-snmpd

A minimal SNMP agent implementation
C
63
star
19

netcalc

Simplified clone of sipcalc with ipcalc looks
C
60
star
20

xplugd

Monitor, keyboard, and mouse plug/unplug helper for X
C
59
star
21

libite

That missing frog DNA you've been looking for
C
58
star
22

ssdp-responder

SSDP responder for UNIX systems that gives you an InternetGatewayDevice icon in Windows :)
C
53
star
23

mdnsd

Jeremie Miller's original mdnsd
C
49
star
24

sntpd

sntpd is a fork of Larry Doolittle's ntpclient with added daemon, syslog, and IPv6 support
C
42
star
25

adventure

Classic Colossal Cave Adventure
C
42
star
26

omping

Open Multicast Ping (omping) is a tool for testing IPv4/IPv6 multicast connectivity on a LAN.
C
33
star
27

snake

Micro Snake, based on Simon Huggins snake game.
C
30
star
28

pok3r-layouts

Vortex POK3R keyboard layouts for Linux, Windows and OS X/macOS. Based on:
28
star
29

rfctl

Linux driver and control tool for 433 MHz communication on Raspberry Pi
C
25
star
30

sun

Simple library and application that shows sunset and sunrise based on your latitude,longitude
C
23
star
31

jush

just give me a unix shell
C
22
star
32

lipify

C API for http://ipify.org
C
19
star
33

pim6sd

PIM for IPv6 sparse mode daemon
C
19
star
34

tinyroot

Small busybox based embedded Linux root file system
Makefile
17
star
35

mping

A simple multicast ping program
C
17
star
36

pimd-dense

Continuation of the original pimd-dense from 1998-1999, gaps filled with frog DNA from pimd
C
14
star
37

pev

Portable Event Library
C
14
star
38

libicmp

Very simple library for sending and receiving ICMP datagrams.
C
13
star
39

backlight

Very simple program to control the backlight brightness of a laptop
C
13
star
40

getty

Minix getty
C
10
star
41

pacman

UNIX pacman game by Dave Nixon, AGS Computers Inc. (1981) with curses support by Mark Horton (1982)
C
10
star
42

uget

Really stupid get-file-over-http program/function
Shell
10
star
43

toolbox

Misc. home brewed code, free to use under GPL/MIT/ISC, see each snippet for license.
C
9
star
44

mrdisc

Stand alone UNIX implementation of RFC4286 Multicast Router Discovery Protocol
C
9
star
45

uemacs

MicroEMACS by Dave Conroy
C
8
star
46

aliens

UNIX aliens game by Jude Miller, Cambridge (1979) with curses support by Mark Horton (1981)
C
7
star
47

nlmon

Simple example of how to use libnl and libev to monitor kernel netlink events
C
7
star
48

quagga

Westermo Quagga. See security/0.99.17 branch for ported CVE fixes and westermo/0.99.17 for patches on top of that. For more information, see the Wiki.
C
7
star
49

keepalived

Health-checking for LVS and high availability
C
6
star
50

ttinfo

Display information about a process, group or tty
C
6
star
51

awesome-config

My awesome window manager configuration
Lua
5
star
52

finit-plugins

Plugin Repository for Finit
C
5
star
53

zoo

public domain zoo archive tool
C
5
star
54

logit

tiny log helper
C
5
star
55

MicroEMACS

MicroEMACS v3.6 by Dave Conroy and Daniel Lawrence from 1986. Free in the public domain.
C
5
star
56

cx

Small wrapper for basic lxc tasks
Shell
5
star
57

awesome-redshift

Ryan Young's small, simple lua library for interfacing the Awesome window manager with redshift
Lua
5
star
58

crobots

CROBOTS is a programming game, for (aspiring) programmers
C
5
star
59

gul

The one true GUL editor!
C
5
star
60

awesome-plain

Plain barebones AwesomeWM setup
Lua
4
star
61

usbctl

A user space tool to operate on USB devices.
C
4
star
62

mctools

Collection of (old) multicast tools
C
4
star
63

deb

Signing key(s) for .deb repository
3
star
64

booz

Zoo Extractor/Lister by Rahul Dhesi
C
3
star
65

sniffer

sniff packets from interface store in db for analysis
C
3
star
66

zroute

Very simple command line client for managing Zebra static routes from the command line. Useful if Zebra is your route manager and you want your DHCP client, or PPPoE client, to set default gateway dynamically via Zebra instead of as kernel routes.
C
3
star
67

libc-chaos

Emit random errors when calling libc functions to emulate an unstable underlying system
C
3
star
68

misc

Misc. helpers, in the public domain
Makefile
2
star
69

alpine-qemu-image

Create bootable Qemu images fro Alpine Linux ISO
Shell
2
star
70

ns

Example of how to use getaddrinfo()
C
2
star
71

awesome-light

Lua library for controlling screen and keyboard brightness from Awesome WM
Lua
2
star
72

plotty

Library for text terminal plotter
C
2
star
73

klish-plugin-sysrepo

Mirror of Serj Kalichev's klish plugins for sysrepo
C
2
star
74

faux

Mirror of Serj Kalichev's auxillary functions library
C
2
star
75

troglobit.github.io

Personal website and blog
CSS
2
star
76

demo

Collection of ASCII art demos
C
1
star
77

ubot

very small and stupid irc bot
C
1
star
78

awesome

My awesome-copycats adaptions
Lua
1
star
79

logrun

An event-based, regexp-triggered, job runner ...
Perl
1
star
80

klish

Mirror of Serj Kalichev's klish
C
1
star
81

br2-finit-demo

Demo of Finit (FastInit) in Buildroot
Makefile
1
star
82

zephyr-uart-test

Playground for the Zephyr UART
C
1
star
83

busybox-builder

busybox defconfig++ binaries
Shell
1
star
84

bridged

Linux bridge helper daemon
C
1
star
85

weatherd

Hackish weather data logger for my homemade Raspberry Pi based weather station
C
1
star