erlperf

Erlang Performance & Benchmarking Suite. Simple way to say "this code is faster than that one". See CLI reference and detailed API reference for erlperf and erlperf_job modules.

Build (tested with OTP 23, 24, 25):

    $ rebar3 as prod escriptize

Quick start: command line

Beware of the shell escaping your code in an unpredictable way!

Run a single process iterating rand:uniform() in a tight loop for 3 seconds, printing average iterations per second (~17 millions) and an average time to run a single iteration (57 ns).

    $ ./erlperf 'rand:uniform().'
    Code                    ||        QPS       Time
    rand:uniform().          1   17266 Ki      57 ns

Run four processes doing this same concurrently.

    $ ./erlperf 'rand:uniform().' -c 4
    Code                    ||        QPS       Time
    rand:uniform().          4   53893 Ki      74 ns

Benchmark rand:uniform() vs crypto:strong_rand_bytes/1 for 10 seconds, adding an extra second to warm up the algorithms.

    $ ./erlperf 'rand:uniform().' 'crypto:strong_rand_bytes(2).' --samples 10 --warmup 1
    Code                             ||   Samples       Avg   StdDev    Median      P99  Iteration    Rel
    rand:uniform().                   1        10  16611 Ki    0.20%  16614 Ki 16664 Ki      60 ns   100%
    crypto:strong_rand_bytes(2).      1        10   1804 Ki    0.79%   1797 Ki  1829 Ki     554 ns    11%

Run a function passing the state into the next iteration. This code demonstrates performance difference between rand:uniform_s/1 with state passed explicitly, and rand:uniform/1 reading state from the process dictionary.

    $ ./erlperf 'r(_, S) -> {_, N} = rand:uniform_s(S), N.' --init_runner 'rand:seed(exsss).' \
                'r() -> rand:uniform().'
    Code                                              ||        QPS       Time   Rel
    r(_, S) -> {_, N} = rand:uniform_s(S), N.          1   26180 Ki      38 ns  100%
    r() -> rand:uniform().                             1   16958 Ki      58 ns   65%

Estimate ./erlperf 'application_controller:is_running(kernel). concurrency characteristics. This function is implemented as gen_server:call, and all calculations are done in a single process. It is still possible to squeeze a bit more from a single process by putting work into the queue from multiple runners.

    $ ./erlperf 'application_controller:is_running(kernel).' --squeeze
    Code                                               ||        QPS       Time
    application_controller:is_running(kernel).          3    1189 Ki    2524 ns



    $ ./erlperf 'persistent_term:put(atom, "string").' -q
    Code                                         ||        QPS       Time
    persistent_term:put(atom, "string").          1    8882 Ki     112 ns

Start a server (pg scope in this example), use it in benchmark, and shut down after.

    $ ./erlperf 'pg:join(scope, group, self()), pg:leave(scope, group, self()).' \
                --init 'pg:start_link(scope).' --done 'gen_server:stop(scope).'
    Code                                                                   ||        QPS       Time
    pg:join(scope, group, self()), pg:leave(scope, group, self()).          1     336 Ki    2976 ns

Run the same code with different arguments, returned from init_runner function. Note the trick of adding extra spaces in the source code to know which code is where.

    $ ./erlperf 'runner(X) -> timer:sleep(X).' --init_runner '1.' \
                '  runner(X) -> timer:sleep(X).' --init_runner '2.'
    Code                                   ||        QPS       Time   Rel
    runner(X) -> timer:sleep(X).            1        500    2001 us  100%
      runner(X) -> timer:sleep(X).          1        333    3001 us   67%

Determine how many times a process can join/leave pg2 group on a single node (requires OTP 23 or older, as pg2 is removed in later versions).

    $ ./erlperf 'ok = pg2:join(g, self()), ok = pg2:leave(g, self()).' --init 'pg2:create(g).'
    Code                                                         ||        QPS       Time
    ok = pg2:join(g, self()), ok = pg2:leave(g, self()).          1      64021   15619 ns

Compare pg with pg2 running in a 3-node cluster. Note the -i argument spawning an isolated extra Erlang VM for each benchmark.

    ./erlperf 'ok = pg2:join(g, self()), ok = pg2:leave(g, self()).' --init 'pg2:create(g).' \
              'ok = pg:join(g, self()), ok = pg:leave(g, self()).' --init 'pg:start(pg).' -i
    Code                                                         ||        QPS       Time     Rel
    ok = pg:join(g, self()), ok = pg:leave(g, self()).            1     241 Ki    4147 ns    100%
    ok = pg2:join(g, self()), ok = pg2:leave(g, self()).          1       1415     707 us      0%

Watch the progress of your test running (-v option) with extra information: scheduler utilisation, dirty CPU & IO schedulers, number of running processes, ports, ETS tables, and memory consumption. Last column is the job throughput. When there are multiple jobs, multiple columns are printed. Test will continue until adding 8 more workers (-t 8) does not increase total throughput.

    $ ./erlperf 'rand:uniform().' -q -v -t 8

    YYYY-MM-DDTHH:MM:SS-oo:oo  Sched   DCPU    DIO    Procs    Ports     ETS Mem Total  Mem Proc   Mem Bin   Mem ETS     <0.84.0>
    2023-01-22T11:02:51-08:00   6.12   0.00   0.20       46        2      21  24737 Kb   4703 Kb    191 Kb    471 Kb     14798 Ki
    2023-01-22T11:02:52-08:00   6.31   0.00   0.00       46        2      21  25105 Kb   5565 Kb    218 Kb    472 Kb     16720 Ki
    2023-01-22T11:02:53-08:00   6.26   0.00   0.00       46        2      21  25501 Kb   5427 Kb    218 Kb    472 Kb     16715 Ki
    <...>
    2023-01-22T11:03:37-08:00 100.00   0.00   0.00       61        2      21  25874 Kb   5696 Kb    221 Kb    472 Kb     55235 Ki
    2023-01-22T11:03:38-08:00 100.00   0.00   0.00       61        2      21  25955 Kb   5565 Kb    218 Kb    472 Kb     55139 Ki
    Code                    ||        QPS       Time
    rand:uniform().          8   61547 Ki     130 ns

Benchmark

Running benchmark is called a job, see erlperf_job for detailed description. Every job has a controller process, responsible for starting and stopping worker processes, or workers. Worker processes execute runner function in a tight loop, incrementing iteration counter.

Benchmark runs either for a specified amount of time (sample duration in continuous mode), or until requested number of iterations is made (timed mode). Resulting sample is total number of iterations for all workers, or elapsed time it took in timed mode.

The process repeats until the specified amount of samples is collected, producing a report (see details below).

For comparison convenience, basic reports contain QPS - historical metric from the original implementation (designed for network service throughput assessment). It is approximate amount of runner iterations per sample_duration achieved by all workers of the job. Given that default duration is 1 second, QPS is a good proxy for the total job throughput.

Single worker performance can be estimated using time metric. It can also be considered as function latency - how long it takes on average to execute a single iteration of a runner.

Benchmark definition

A benchmark may define following functions:

runner: code that is executed in the tight loop
init (optional): executed once when the job starts
done (optional): executed once when the job is about to stop
init_runner (optional): executed on every worker process startup

See erlperf_job for the detailed reference and ways to define a function (callable).

Note that different ways to call a function have different performance characteristics:

    $ ./erlperf '{rand, uniform, []}' 'rand:uniform().' -l 10M
    Code                      ||        QPS       Time   Rel
    rand:uniform().            1   18519 Ki      54 ns  100%
    {rand,uniform,[]}          1   16667 Ki      60 ns   90%

This difference may get more pronounced depending on ERTS version and runner code:

    (erlperf@max-au)7> erlperf:benchmark([
            #{runner => "runner(X) -> is_float(X).", init_runner=>"2."},
            #{runner => {erlang, is_float, [2]}},
            #{runner => fun (X) -> is_float(X) end, init_runner => "2."}],
        #{}, undefined).
    [105824351,66424280,5057372]

It is caused by the ERTS: running compiled code (first variant) with OTP 25 is two times faster than applying a function, and 20 times faster than repeatedly calling anonymous fun. Use the same invocation method to get a relevant result.

Absolute benchmarking overhead may be significant for very fast functions taking just a few nanoseconds. Use timed mode for such occasions.

Run options

See erlperf module documentation and command line reference for all available options.

Benchmarking modes

Continuous mode

Benchmarking is done by counting number of runner iterations done over a specified period of time (sample_duration).

Two examples below demonstrate the effect caused by changing sample_duration. First run takes 20 samples (-s 20) with 100 ms duration. Second invocation takes the same 20 sample, but with 200 ms duration (-d 200). Note that all metrics, except a single iteration time, doubled.

    $ ./erlperf 'rand:uniform().' -d 100 -s 20
    Code                ||   Samples       Avg   StdDev    Median      P99  Iteration
    rand:uniform().      1        20   1647 Ki    0.39%   1648 Ki  1660 Ki      60 ns
    $ ./erlperf 'rand:uniform().' -d 200 -s 20
    Code                ||   Samples       Avg   StdDev    Median      P99  Iteration
    rand:uniform().      1        20   3354 Ki    0.16%   3354 Ki  3368 Ki      59 ns

Timed mode

In this mode runner code is executed for sample_duration iterations for every sample. Report contains average/median/p99 time it takes to produce a single sample. In the example below, it takes an average of 554 ms to make 10 million calls to rand:uniform().

    $ ./erlperf 'rand:uniform().' 'rand:uniform(1000).' -l 10M -s 20
    Code                    ||   Samples       Avg   StdDev    Median      P99  Iteration    Rel
    rand:uniform(1000).      1        20    554 ms    0.37%    554 ms   563 ms      55 ns   100%
    rand:uniform().          1        20    560 ms    0.60%    560 ms   564 ms      55 ns    99%

Effectively, this example runs following code: loop(0) -> ok; loop(Count) -> rand:uniform(), loop(Count - 1). Timed mode has slightly less overhead compared to continuous mode.

Timed mode does not support --concurrency setting, using only one process. However, it does support comparison run with multiple concurrent jobs.

Concurrency estimation mode

In this mode erlperf performs multiple continuous benchmarks with increasing concurrency. The test concludes when increasing worker count does not result in increase of the total throughput. Report contains statistics of the most successful run.

This mode can also be used to detect bottlenecks, e.g. lock contention, single gen_server processes, or VM-wide shared resources (persistent_terms). Example (with maximum concurrency limited to 50):

    $ ./erlperf '{code, is_loaded, [local_udp]}' -w 1 --max 50 -q
    Code                                 ||        QPS       Time
    {code,is_loaded,[local_udp]}          6    1665 Ki    3604 ns

Same in the Erlang shell:

    > erlperf:run({code, is_loaded, [local_udp]}, #{warmup => 1}, #{max => 50}).
   {1676758,6}

In this example, 6 concurrent processes were able to squeeze 1676758 calls per second for code:is_loaded(local_udp). In current OTP version code:is_loaded is implemented as a gen_server:call to a single process (code_server), that limits potential performance.

See erlperf_job for the detailed description of different benchmarking modes.

Reports

Historically erlperf had only the basic reporting available for command line usage. Since 2.2 it is possible to request additional information.

Basic report

This is the default report form when less than 10 samples were collected. Use -r basic to force basic reports with 10 and more samples.

Basic report contains following columns:

Code: Erlang code supplied to the benchmark
||: how many concurrent processes were running. In the timed mode, it is always 1. In the concurrency estimation mode, the number that achieved the highest total throughput (QPS)
QPS: average number of runner code iterations (throughput). Measure per single sample_duration in the continuous mode. In the timed mode, calculated with the assumption that sample_duration is 1 second
Time: single runner iteration time
Rel: relative performance of this code, compared to others. Printed only when more than one runner is specified.

Extended report

When 10 or more samples were collected, this mode is the default. Use -r extended to force printing this report for smaller sample sets.

Note that average, deviation, median and 99th percentile are calculated for the sample_duration. If you requested 20 samples of 100 ms in the continuous mode, these fields will contain iteration count per 100 ms. If you requested 10 million iterations (-l 10M), extended report for timed mode displays average time it takes to do 10M iterations. Single iteration time is printed as Iteration.

Code, concurrency, and relative performance fields have the same meaning as in basic report. In addition, following columns are printed:

Samples: how many samples were collected (useful when requesting continuous test with standard deviation requirement)
Avg: same as QPS for continuous mode, but in the timed mode, average sample time
StdDev: standard deviation from average
Median: median value, in the continuous mode, median estimated throughput, in the timed mode - time to complete the requested iterations
Iteration: single runner iteration time
P99: 99th percentile

Full report

This mode must be explicitly specified with -r full.

Contains everything that extended report has. Includes extra information about the system used for benchmarking - OS type, CPU and Erlang VM characteristics.

Benchmarking compiled code

erlperf can be used to measure performance of your application running in production, or code that is stored on disk.

Use -pa argument to add an extra code path. Example:

    $ ./erlperf 'args:parse([], #{}).' -pa _build/test/lib/argparse/ebin
    Code                             ||        QPS       Time
    args:parse([], #{}).              1     955 Ki    1047 ns

If you need to add multiple released applications, supply ERL_LIBS environment variable instead:

    $ ERL_LIBS="_build/test/lib" erlperf 'args:parse([], #{}).'
    Code                             ||        QPS       Time
    args:parse([], #{}).              1     735 Ki    1361 ns

Usage in production

It is possible to use erlperf to benchmark an application running in production. Add erlperf as a dependency, and use remote shell:

    # connect a remote shell to the production node
    erl -remsh production@max-au
    (production@max-au)3> erlperf:run(timer, sleep, [1]).
    488

Permanent continuous benchmarking

You can run a job continuously, to examine performance gains or losses while doing hot code reload. This process is designed to help during development and testing stages, allowing to quickly notice performance regressions.

Example source code:

    -module(mymod).
    -export([do/1]).
    do(Arg) -> timer:sleep(Arg).

Example below assumes you have erlperf application started (e.g. in a rebar3 shell)

    % start a logger that prints VM monitoring information
    > {ok, Logger} = erlperf_file_log:start_link().
    {ok,<0.235.0>}

    % start a job that will continuously benchmark mymod:do(),
    %  with initial concurrency 2.
    > JobPid = erlperf:start(#{init_runner => "rand:uniform(10).",
        runner => "runner(Arg) -> mymod:do(Arg)."}, 2).
    {ok,<0.291.0>}

    % increase concurrency to 4
    > erlperf_job:set_concurrency(JobPid, 4).
    ok.

    % watch your job performance

    % modify your application code,
    % set do(Arg) -> timer:sleep(2*Arg), do hot code reload
    > c(mymod).
    {module, mymod}.

    % see that after hot code reload throughput halved!

Timer precision

ERTS cannot guarantee precise timing when there is severe lock contention happening, and scheduler utilisation is 100%. This often happens with ETS:

    $ ./erlperf -c 50 'ets:insert(ac_tab, {1, 2}).' -d 100 -s 50
    Timer accuracy problem detected, results may be inaccurate

    Code                            ||   Samples       Avg   StdDev    Median      P99  Iteration
    ets:insert(ac_tab, {1, 2}).     50        50      6079   82.27%      5497    40313     823 us

Running 50 concurrent processes trying to overwrite the very same key of an ETS table leads to lock contention on a shared resource (ETS table/bucket lock). erlperf may detect this issue and switch to a busy wait loop for precise timing. This may result in lowered throughput and other metrics skew. erlperf does not attempt to pinpoint the source of contention, it is up to user to figure that out. It's recommended to use lock-counting emulator, or Linux perf utility to troubleshoot VM-level issues.

Experimental features

These features are not fully supported. APIs may change in the future erlperf releases.

Benchmarking in a cluster

It's possible to run a job on a separate node in the cluster. See erlperf_cluster_monitor for additional details.

    % watch the entire cluster (printed to console)
    (node1@host)> {ok, _} = erlperf_history:start_link().
    {ok,<0.213.0>}
    (node1@host)> {ok, ClusterLogger} = erlperf_cluster_monitor:start_link(group_leader(), 1000, [node, sched_util, jobs]).
    {ok, <0.216.0>}

    % also log cluster-wide reports to file (jobs & sched_util)
    (node1@host)> {ok, FileLogger} = erlperf_cluster_monitor:start_link("/tmp/cluster", 1000, [time, node, sched_util, jobs]).
    {ok, <0.223.0>}

    % run the benchmarking process in a different node of your cluster
    (node1@host)> rpc:call('node2@host', erlperf, run, [#{runner => {rand, uniform, []}}]).

max-au/erlperf

max-au

Reviews

Repository Details