python – Why does 2*i*i tend to be faster than 2*(i*i) when i is integer?

Question:

The two multiplications, 2*i*i and 2*(i*i) , are equal and must yield the same result, what only changes is the order in which the multiplications are done, but apparently they are treated differently by the interpreter.

In the first multiplication, given the absence of parentheses and considering that all operationshave the same precedence , the interpreter will execute from left to right; that is, first 2*i will be done and the result multiplied by i .

In the second multiplication, given the presence of parentheses, the order of execution is changed as the i*i multiplication takes precedence over the multiplication by 2, so the interpreter will first do i*i and the result will multiply by 2.

Mathematically, the order of the factors does not change the product, so the result should be the same, but using the native timeit package it is possible to see a difference between the execution times between these multiplications:

>>> print('2*i*i:', timeit("for i in range(1000): 2*i*i"))
2*i*i: 276.91411599100684

>>> print('2*(i*i):', timeit("for i in range(1000): 2*(i*i)"))
2*(i*i): 279.21798045100877

Tests were done on Repl.it

Note: It is important to note that the timeit function timeit run, by default, 1,000,000 times the indicated code snippet and that the exact execution time may vary due to processor and computer fluctuations.

Why is there such a time difference? Does the presence of parentheses in the expression alter the executed algorithm?

Pointing out that the analysis should be done on the official Python implementation, CPython, version 3+. Comparisons with other versions or implementations will be welcomed as a way to improve the response.

Answer:

"You already know, but it never hurts to repeat" if you think you need optimizations at this level in a snippet of code in Python, you're writing that snippet in the wrong language (or any other very high-level language like ruby, php, and even Java).

Now there are some things you can answer in your question, and speculate a little, even without reaching a definitive answer.

First – when you want to know differences like this in Python, it's worth looking at what bytecode was generated for each piece of code. By comparing the bytecode it is easier to speculate what the differences might be. In Python, the dis function of the module of the same name lets you see the bytecode:

In [54]: def a(): 
    ...:     return 2 * i * i 
    ...:                                                                                                 

In [55]: def b(): 
    ...:     return 2 * (i * i) 
    ...:                                                                                                 

In [56]: import dis                                                                                      

In [57]: dis.dis(a)                                                                                      
  2           0 LOAD_CONST               1 (2)
              2 LOAD_GLOBAL              0 (i)
              4 BINARY_MULTIPLY
              6 LOAD_GLOBAL              0 (i)
              8 BINARY_MULTIPLY
             10 RETURN_VALUE

In [58]: dis.dis(b)                                                                                      
  2           0 LOAD_CONST               1 (2)
              2 LOAD_GLOBAL              0 (i)
              4 LOAD_GLOBAL              0 (i)
              6 BINARY_MULTIPLY
              8 BINARY_MULTIPLY
             10 RETURN_VALUE

So as you can see, there is actually a slight difference: In the first case, the bytecode alternates loading the values ​​to be multiplied into the Python virtual machine stack with the calls to multiply itself. In the second case, it puts all the values ​​on the virtual machine's stack, and then calls the multiplication operator twice in a row. It may be that optimizers at the CPU microcode level will be able to optimize this repeated call next to the function that will be called by the BINARY_MULTIPLY opcode (for example, the next call might get more hits in the CPU prediction branch).

Anyway, if not exactly that, I would still bet my chips that what happens is exactly some CPU microcode level optimization that gets activated in the second case.

Exactly the kind of thing you'll rarely worry about even if you're coding in C, let alone Python. And in that case, before saying "wow, let me use inline assembler", it would be a case of looking for high-level solutions that could use the computer's resources more appropriately – code that uses the GPU or the vector processing units for example, which would give a gain orders of magnitude greater than micro-optimizing a single operation. (And in the case of Python, the "first" remedy will always be to use NumPy) .

To get rid of the stubbornness if by chance there is nothing inside the Python code itself, just also checking the code that will be called for the BINARY_MULTIPLY – which certainly has optimizations for when both operands are Python Integers, but not beyond that (by example, one more optimization if one of the operators is '2' – anyway, the runtime has to call the code in int.__mul__ for that) – but I'm talking about optimizations that will be between the VM finding the opcode and call int.__mul__ .

By the way, the difference really is so small that other changes make things change – see what happens when I measure the times for functions a and b above:

In [59]: i = 10                                                                                          

In [60]: %timeit a()                                                                                     
119 ns ± 1.71 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [61]: %timeit b()                                                                                     
123 ns ± 0.725 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

That is, on this machine, the form 2 * i * i is faster!

Scroll to Top