Programming language C++: Benchmark of the parallel STL algorithms

Today I’m happy to present a guest post by Victor J. Duvanenko about my favorite feature of the STL: the parallel STL algorithms. Victor has published a book on “Practical Parallel Algorithms in C++ and C#”. This jumps me straight to the translation of Victor’s English guest post:

Rainer Grimm has been working as a software architect, team leader and training manager for many years. He likes to write articles on the programming languages C++, Python and Haskell, but also likes to speak frequently at specialist conferences. On his blog Modernes C++ he deals intensively with his passion for C++.

C++ includes a standard set of generic algorithms, the STL (Standard Template Library). On Windows, Microsoft offers side-by-side versions of these algorithms, listed below, where the first argument is std::. Intel also offers parallel implementations, which are listed below with the first argument dpl::. These parallel versions use multiple cores of the processor and offer much faster performance for many of the standard algorithms.

Specifying a single-core or multi-core version of algorithms is easy, as shown below:

sort(data.begin(), data.end()); // single-core
// or
sort(std::execution::seq, data.begin(), data.end()); // single-core
// Parallel multi-core
sort(std::execution::par, data.begin(), data.end()); // multi-core

The first two calls to sort() are equivalent and run on a single-core processor. The last call to sort() uses multiple cores. There are four ways to run the algorithm:

seq – single-core sequential versionunseq – single-core SIMD/SSE vectorizedpar – multi-core versionpar_unseq – multi-core SIMD/SSE vectorized

The following table shows the algorithms that exhibit the highest degree of parallel acceleration when running on a 14-core i7-12700 laptop processor and processing an array of 100 million 32-bit integers:

n/s means not supported, ie the variant is not implemented.

Benchmarks on a 48-core processor will be shown later. The units for each item in the table are millions of integers per second. For example, the par version of sort(std:: runs at 73 million integers/second and the par version of sort(dpl:: at 109 million integers/second. Each column shows the performance of an implementation from which the Array -size of 100 million was chosen.This size rules out caching effects as the array is too large to fit in the processor’s cache.This is used to measure the parallel improvement for the very large array use case.

benchmark environment

The benchmarks above were run with VisualStudio 2022 in Release mode using the Intel C++ compiler and the Intel OneAPI 2023.1 environment. This allowed access to both Microsoft std::implementations and Intel dpl::implementations. The benchmarks were carried out under Windows 11.

observations

Not all standard features are implemented by Microsoft or Intel.

Microsoft does not support the sequential SIMD (unseq) implementation for many algorithms because this implementation does not compile. For other algorithms, the Microsoft implementations compile, but offer no performance benefit.

Intel implements more parallel algorithms with higher performance for most algorithms except count (at 48 cores). Parallel functions experience parallel scaling as the number of cores increases. All algorithms speed up when run on multiple cores.

Other standard parallel algorithms

The following table contains benchmarks for single-core and multi-core implementations of standard C++ algorithms on the 14-core laptop processor:

More cores, more power

Some of the standard C++ algorithms scale better than others as the processor core count increases. The following table shows the performance of single-core serial and multi-core parallel algorithms on a 48-core Intel Xeon 8275CL AWS node when processing an array of 100 million 32-bit integers:

Each parallel algorithm is accelerated more when more cores and higher memory bandwidth are available on the 48-core processor. For large problems, parallel algorithm implementations offer a significant performance advantage.

Parallel algorithms from Nvidia

Nvidia has developed a C++ parallel STL implementation for the nvc++ compiler that runs on x86 instruction set processors such as those from Intel and AMD. The table below will be updated shortly for compiling the benchmark with -stdpar=multicore:

Nvidia’s parallel implementations show significant performance differences compared to those of Intel and Microsoft. Nvidia also supports accelerating standard C++ algorithms with GPUs. See benchmarks in this blog.

Benchmark-Code

The source code for the benchmark implementation is freely available in this repository:

Own algorithms with higher performance

If you want to develop your own parallel algorithms, you can find out more about this in the following book using examples. For example, the standard algorithms only implement a comparison-based sort. The parallel linear-time radix sort developed in the book boosts performance many times over. Various strategies to simplify parallel programming are demonstrated, with all source code freely available in a repository:

What’s next?

In my next article I will continue my journey through C++23. This time I will write about std::expected. (rm)

To home page

Benchmark C/C++Parallel Computing programming languages STL

Programming language C++: Benchmark of the parallel STL algorithms

benchmark environment

observations

Other standard parallel algorithms

More cores, more power

Parallel algorithms from Nvidia

Benchmark-Code

Own algorithms with higher performance

What’s next?

Share this:

Related

Javier Milei in Neuquén, a triumph without celebrations that feeds unbelievers

Traveling safely with a heart condition: How to avoid problems with blood pressure, heart and circulation

You may also like

Leave a Comment Cancel Reply