highload – How do you measure the performance of your code in production?

Question:

There is a backend that runs a lot of web applications. I would like to achieve two goals: 1. To reduce the page generation time, so that Google and users will love our site. 2. Reduce the consumption of CPU resources so that fewer backends are needed.

So when introducing different optimizations in the code, we usually make different measurements. Most often, we simply measure the average amount of time required to process one http request.

Average per day, since for more time to wait too long. Well, in general, I would like to somehow make it so that there is no need to wait a day – I looked for 5 minutes, collected the data, calculated the average, and you conclude whether performance has improved or worsened.

So the execution time of the spherical algorithm depends, incl. from the load on the server. Those. you can change something in the code, roll it out to production, and see that the load has changed in the wrong direction, in which the efficiency of the algorithm actually changed, since it's just that now the hour or day of the week is such that many requests are sent by users.

I tried to measure only the user's time (as the time utility measures). But when the server is not 100% loaded, this metric is also sensitive to the server load (a half-empty server works almost 2 times faster than a 100% loaded one).

100% load here can be considered the execution of as many single-threaded processes that eat up all the available processor time, as many physical processor cores are on the server. Those. where hyperthreading is enabled, you need to start such processes 2 times less than you can see the cores, and where it is disabled – as many as you can see the cores.

This is how I measured:

#!/usr/bin/perl

use strict;
use warnings;
use Time::HiRes qw(time);

my $forks = shift;

my $lscpu = `lscpu`;
my($cpus) = $lscpu =~ /^CPU\(s\):\s+(\d)$/m;
my($threadsPerCore) = $lscpu =~ /^Thread\(s\) per core:\s+(\d)$/m;
my $cores = $cpus / $threadsPerCore;

sub load {
    my $a = 0;
    $a += rand() foreach(0 .. 100000000)
}

fork() for (1 .. $forks);

my $u = - times();
my $t = - time();
load();
$u += times();
$t += time();


printf "| %d | %d | %d | %.2f | %.2f |\n", $cpus, $cores, 2 ** $forks, $t, $u;

And here are my measurements on a hyperthreading machine:

|----|-----|-----|-------|---------|
|CPUs|Cores|Procs| Time  |User time|
|----|-----|-----|-------|---------|
|  4 |  2  |  1  | 11.08 |  11.07  |
|  4 |  2  |  2  | 11.70 |  11.69  |
|  4 |  2  |  4  | 19.79 |  19.64  |
|  4 |  2  |  8  | 39.42 |  19.62  |
|  4 |  2  |  16 | 83.36 |  19.86  |
|----|-----|-----|-------|---------|

And on a machine without hyperthreading:

|----|-----|-----|-------|---------|
|CPUs|Cores|Procs| Time  |User time|
|----|-----|-----|-------|---------|
|  2 |  2  |  1  | 23.74 |  23.73  |
|  2 |  2  |  2  | 23.53 |  23.52  |
|  2 |  2  |  4  | 46.78 |  23.38  |
|  2 |  2  |  8  | 93.76 |  23.43  |
|----|-----|-----|-------|---------|

And on this machine User time is about the same everywhere! But what is wrong with the first car? As soon as I load more physical cores on it than it has, the user time increases. What's this? The magic of hyperthreading? But in htop you can see that during the test, only one out of 4 virtual cores was loaded.

UPD: Launched it on one more machine with hyperthereading:

|----|-----|-----|-------|---------|
|CPUs|Cores|Procs| Time  |User time|
|----|-----|-----|-------|---------|
|  8 |  4  |  1  | 6.23  |  6.18   |
|  8 |  4  |  2  | 6.20  |  6.16   |
|  8 |  4  |  4  | 8.38  |  8.33   |
|  8 |  4  |  8  | 19.95 |  11.90  |
|  8 |  4  |  16 | 33.71 |  11.98  |
|----|-----|-----|-------|---------|

Here the user time grows until we load all 8 virtual cores, not real ones.

Answer:

This is a multilevel task, you need to use different approaches for different optimizations.

Sampling

Sampling's idea – we do a stack dump from the process, then statistical analysis works. This hike shows the methods that work the longest the most.

In it, you should find the parts of your code that need to be accelerated.

Thread Context Switch Rate

If you have synchronous code, then frequent Context Switches occur for IO operations, that is, the thread tells the OS that "I will wait, I have no tasks", and the OS switches the kernel to another thread. This leads to a waste of CPU. In this case, the processor is consumed by the operating system, not your application, and therefore the drop in performance is greatly smeared.

See here for a guide on how to verify that your application is affected by this issue. It can be cured by increasing the number of non-blocking operations (i.e. everything that potentially leads us to Wait before IO work is our problem).

Memory traffic

If the application is allocating a lot of memory, then this creates additional load on the GC. Here, as in the previous paragraph, it is important to compare with the standard. For example, for Java / .Net, it is considered that GC should not take more than a few percent of the total program runtime.

This item can be checked on Production (the Sampling item will say that GC took 5% of the time), however, it is necessary to treat the problem locally, checking where the most memory is allocated.

Important : if a lot of memory is allocated, but the GC runtime is short, then there is no problem.

Upper Level

The points above are suitable for any software. However, sometimes there may be brakes in the calling server, i.e. for example, when the number of active threads is less than the number of parallel requests (then the user will wait, although your application seems to be running fast). In such cases, there are no general solutions, it all depends on the specific environment.

Scroll to Top