Performance comparison of parallel ray tracing in functional programming languages

This repository is an embryonic collection of ray tracers written with parallel functional programming techniques. The intent is to investigate, on a rather small and simple problem, to which degree functional programming lives up to the frequent promise of easy parallelism, and whether the resulting code is actually fast in an objective sense. The benchmarking technique is mostly crude, so assume only large relative differences are meaningful. I welcome contributions, as I have little confidence that any of my code is optimal. I am an expert in at most one of the languages on exhibition here. I also welcome new implementations in other languages!

Note also that this is not a good ray tracer. It does not generate particularly pretty images. It's chosen simply because it expresses two interesting kinds of parallelism (see below), and because even an ugly image is more interesting than just a number. Two scenes are used. The first is rgbbox:

The second is irreg:

This second scene is interesting because the load is unbalanced: all objects are in the lower half of the pixels.

For each scene, two things are benchmarked:

Constructing a BVH of the scene. This is interesting because it is a divide-and-conquer task parallel problem.
Actually rendering the scene, accelerated by the BVH. This is mostly straightforward data parallelism, but with a potentially beefy amount of work for each pixel.

Results

The following measurements are for 1000x1000 renderings. I used a Ryzen 1700X (8 cores, 16 threads) CPU and an MI100 GPU. Compare numbers within the same column.

Language	rgbbox (BVH)	rgbbox (render)	irreg (BVH)	irreg (render)
F#	0.5ms	816ms	6.1ms	437ms
Futhark (GPU)	1.1ms	14ms	1.4ms	8ms
Futhark (CPU)	0.2ms	179ms	2.8ms	62ms
Haskell	0.3ms	590ms	12.2ms	344ms
MPL	0.4ms	341ms	9.4ms	112ms
OCaml	1.3ms	723ms	15ms	240ms
Rust	0.1ms	258ms	0.8ms	100ms
Scala	0.2ms	306ms	4.2ms	126ms

Commentary

The Haskell implementation uses the Strict language pragma to disable laziness in the core modules. This has about 1.5-2x impact on the run-time. The massiv library is used for parallel arrays and is the source of most of the performance.

After a few false starts, F# runs quite fast when using .NET Core. The main tricks appear to be using inline functions and explicit value types.

MPL (which is a parallelism-oriented fork of MLton for Standard ML) is definitely the star here. The code is readable, written in a completely natural style, and performance is excellent.

Multicore OCaml is also quite fast, and the code is likewise very clean.

While the implementations are allowed to use single-precision floating point if they wish, the Scala implementation is actually much faster when using double precision.

While Futhark is fast, the code is significantly longer and more complex. This is particularly because of the BVH construction. In all other implementations, the BVH is expressed as a straightforward recursive divide-and-conquer function, which is also easy to parallelise with fork-join techniques. Since Futhark does not support recursion, it instead uses a bottom-up technique presented by Tero Karras in the paper Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees. This is actually a pretty fast technique (although not for the small scenes used here), but it is about two hundred lines longer than the recursive formulation. The CPU timings use the multicore backend and clang for compiling the C code.

While Rust is not a functional language, it is included as an example of the performance of (relatively) low level programming. Unsurprisingly, it is among the fastest CPU languages, as it has a mature compiler, and its default behaviour of unboxing everything is exactly what you need for this program.

What is not visible from the above table is that most of the implementations were significantly slower in their original formulation. Only Futhark, MPL, and Rust are essentially unchanged from their first straightforward implementation. For the others, most of the performance comes down to various low-level tweaks, in particular avoiding boxing and allocations. This is not exactly unexpected, but I still find it sad that when it comes to performance in functional languages, we must think about the compiler more than we think about the language.

athas/raytracers

athas

Reviews

Repository Details

Performance comparison of parallel ray tracing in functional programming languages

Results

Commentary

See also

More Repositories