Possibilities for increasing the performance of multithreaded programs

Question:

Good afternoon everyone!

The next question is ripe for those who are engaged in multithreaded programs: how do you improve the performance indicators of the program, if it has not met the expectations assigned to it in this regard?

I will describe the problem in more detail. After transferring the project for FreeBSD to a multithreaded version, I didn’t like the final speed of work. The logic of the original project had to remain practically identical, and it was not possible to distribute the data among the streams so that, for example, one stream worked with only one structure, and did not climb anywhere else. Accordingly, the emphasis was on "fine synchronization" – a lot of locks and mutexes were released, which, in theory, should not be captured by someone for a long time. But now there are fears that the presence of a large number of calls to mutexes has negated the benefits of using multithreading. Please share your experience, what is worth paying attention to, and are there any other possibilities of fighting for speed besides the complete reworking of the project logic?

UPDATE

To avoid confusion, I will describe an approximate model of the project. In the system under consideration, there were 4 threads per 4 cores:

  • 1st – the main thread, serves new connections, handles interrupts, and performs non-urgent work during idle time;
  • 2nd – the flow of receiving and processing requests from users;
  • 3rd and 4th – threads that process different parts of the traffic.

All streams periodically read and write data into various structures (about 10, include various counters, arrays with indices) at non-deterministic moments in time relative to each other, as well as read and edit 4 different lists. The main idea of ​​the multithreaded version of the program was to separate users from traffic processing so that in case of high load they would not wait too long, as well as to increase the traffic processing speed.

Answer:

A bunch of mutexomes can slow down a program very well. Another program can be slowed down by a cache miss. For example, the program simply took two arrays and summed them up into a third. But the arrays are big (hundreds of millions of elements). After parallelization, 8 threads work with 8 arrays. And most likely 8*3 = 24 pieces of different memory will not fit into the cache so well, there will be constant loading of data into the cache, and unloading (since other threads also need it). In this case, a decrease in the number of threads can give a significant increase.

There are algorithms that are difficult to parallelize just like that. For example, calculation using recurrent formulas (when the next element depends on the previous one).

How do you improve the performance indicators of the program if it has not met the expectations assigned to it in this regard?

for a start in the classical way – profiling. Then, when suspicious places are found, a search for good algorithms. For example, sometimes bubble sort can be much better than quick sort – for example, it is known that data is almost always sorted and no more than 2-3 elements are out of place. quick sort in this case will spend a lot of memory, but in terms of speed it is unlikely to win.

What to do?

  • remove as much mutexes and locks as possible. For example, atomic can be used for different counters.
  • look at the algorithms, perhaps the use of sse will give an increase of 3-4 times. Although sometimes even turning on the compiler option can fix things.
  • revision of multithreading. Perhaps OpenMP can help a lot. It can be of great help if you need to parallelize small loops that have an "independent body".
  • disable debug output. How funny it is :). I had a case when, by removing one line of output to the log, the code accelerated tenfold. It's just that the logger was blocking with forced dumping to a file. The funny thing is that this line displayed the processing time of a block of code, but in fact it worked much longer than the code itself.
Scroll to Top