Home » Programming language C++: Benchmark of the parallel STL algorithms

Programming language C++: Benchmark of the parallel STL algorithms

by admin
Programming language C++: Benchmark of the parallel STL algorithms

Programming language C++: Benchmark of the parallel STL algorithms

Today I’m happy to present a guest post by Victor J. Duvanenko about my favorite feature of the STL: the parallel STL algorithms. Victor has published a book on “Practical Parallel Algorithms in C++ and C#”. This jumps me straight to the translation of Victor’s English guest post:

Advertisement

Rainer Grimm has been working as a software architect, team leader and training manager for many years. He likes to write articles on the programming languages ​​C++, Python and Haskell, but also likes to speak frequently at specialist conferences. On his blog Modernes C++ he deals intensively with his passion for C++.

C++ includes a standard set of generic algorithms, the STL (Standard Template Library). On Windows, Microsoft offers side-by-side versions of these algorithms, listed below, where the first argument is std::. Intel also offers parallel implementations, which are listed below with the first argument dpl::. These parallel versions use multiple cores of the processor and offer much faster performance for many of the standard algorithms.

Specifying a single-core or multi-core version of algorithms is easy, as shown below:

sort(data.begin(), data.end()); // single-core
// or
sort(std::execution::seq, data.begin(), data.end()); // single-core
// Parallel multi-core
sort(std::execution::par, data.begin(), data.end()); // multi-core

The first two calls to sort() are equivalent and run on a single-core processor. The last call to sort() uses multiple cores. There are four ways to run the algorithm:

seq – single-core sequential versionunseq – single-core SIMD/SSE vectorizedpar – multi-core versionpar_unseq – multi-core SIMD/SSE vectorized

The following table shows the algorithms that exhibit the highest degree of parallel acceleration when running on a 14-core i7-12700 laptop processor and processing an array of 100 million 32-bit integers:

See also  Inpaint-web runs locally, and you can use the browser to remove objects from images and enlarge images | Computer King Ada | LINE TODAY

Advertisement

n/s means not supported, ie the variant is not implemented.

Benchmarks on a 48-core processor will be shown later. The units for each item in the table are millions of integers per second. For example, the par version of sort(std:: runs at 73 million integers/second and the par version of sort(dpl:: at 109 million integers/second. Each column shows the performance of an implementation from which the Array -size of 100 million was chosen.This size rules out caching effects as the array is too large to fit in the processor’s cache.This is used to measure the parallel improvement for the very large array use case.

The benchmarks above were run with VisualStudio 2022 in Release mode using the Intel C++ compiler and the Intel OneAPI 2023.1 environment. This allowed access to both Microsoft std::implementations and Intel dpl::implementations. The benchmarks were carried out under Windows 11.

Not all standard features are implemented by Microsoft or Intel.

Microsoft does not support the sequential SIMD (unseq) implementation for many algorithms because this implementation does not compile. For other algorithms, the Microsoft implementations compile, but offer no performance benefit.

Intel implements more parallel algorithms with higher performance for most algorithms except count (at 48 cores). Parallel functions experience parallel scaling as the number of cores increases. All algorithms speed up when run on multiple cores.

The following table contains benchmarks for single-core and multi-core implementations of standard C++ algorithms on the 14-core laptop processor:

Some of the standard C++ algorithms scale better than others as the processor core count increases. The following table shows the performance of single-core serial and multi-core parallel algorithms on a 48-core Intel Xeon 8275CL AWS node when processing an array of 100 million 32-bit integers:

Each parallel algorithm is accelerated more when more cores and higher memory bandwidth are available on the 48-core processor. For large problems, parallel algorithm implementations offer a significant performance advantage.

See also  Dead Space Remake devs explain how their new intensity director will scare seasoned players

Nvidia has developed a C++ parallel STL implementation for the nvc++ compiler that runs on x86 instruction set processors such as those from Intel and AMD. The table below will be updated shortly for compiling the benchmark with -stdpar=multicore:

Nvidia’s parallel implementations show significant performance differences compared to those of Intel and Microsoft. Nvidia also supports accelerating standard C++ algorithms with GPUs. See benchmarks in this blog.

The source code for the benchmark implementation is freely available in this repository:

If you want to develop your own parallel algorithms, you can find out more about this in the following book using examples. For example, the standard algorithms only implement a comparison-based sort. The parallel linear-time radix sort developed in the book boosts performance many times over. Various strategies to simplify parallel programming are demonstrated, with all source code freely available in a repository:

In my next article I will continue my journey through C++23. This time I will write about std::expected. (rm)

To home page

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy