Question:
I am doing optimization on one project that does not fit well into the performance requirements. Fenwick trees are widely used here to quickly obtain prefix sums (AKA intermediate sums ) for an array with constantly changing contents. The structure itself is very useful and helps a lot, but here, in addition to standard operations for incrementing arbitrary elements, two additional nonstandard operations are very often used (I have not found analogs in the literature). And these operations account for a significant part of the computational load.

Obtaining values (uploading to an external array) of prefix sums for elements from the continuous range from
i
toi + k
. 
Zeroing (or, say, filling with a constant) of elements from the continuous range from
i
toi + k
with correct recalculation of all affected amounts.
Important: For both operations, the average value of k
much less than n
(the number of elements in the tree).
At the moment, these operations are implemented stupidly and simply:

In a loop (separately!), For each element from the range, its value is found (a standard operation for the Fenwick tree). Total: the complexity of the operation is
O(k*log(n))
fork
elements. 
In the loop (separately!), For each element, its value is found, after which it is incremented by the number reciprocal to its current value (to be reset to zero). The complexity, again, is
O(k*log(n))
.
I thought about the structure of trees and these operations. And I think both of them can be implemented with less complexity. After all, log(n)
is the depth of the passage from the root to an arbitrary element, and in these operations, a group of consecutive elements is traversed. It is possible to take advantage of this fact, since the closest common node of the tree, at best, lies at a depth of log(k)
from any of the end nodes in a continuous range of length k
. So, to traverse the entire range, you can go down from the root in log(n)
, and then go up and down from one neighboring element to another in log(k)
. Total: the complexity of the algorithm should be O(log(n) + k*log(k))
(which is much better than the current version). This is for the best case. But the average estimate should also be close to this, since the worst case O(k*log(n))
unlikely for k ≪ n
.
Now I am stuck over the implementations of the algorithms for these two operations. So far, I'm not quite clear how to go from element i
to element i+1
without going back to the root. In theory, you need to go up to the nearest common node and go down back to the next element, along the way counting the amounts required to get the value of the next element. And for operation (2), apparently, it is necessary to simultaneously correct the values of intermediate nodes on the same pass as when performing a standard increment (if it is at all possible to do it simultaneously).
These are all general considerations for binary trees. How this should be implemented specifically for the Fenwick tree, I still have no idea.
I would be grateful for help with specific algorithms in pseudocode or any imperative language.
If my reasoning about complexity is wrong, then I would like to know what exactly I'm wrong about.
Answer:
Yes, you can try to speed up these operations. I will give my implementation for a start:
int next2n(int x) {
// only 32bit
x = x  1;
x = x  (x >> 1);
x = x  (x >> 2);
x = x  (x >> 4);
x = x  (x >> 8);
x = x  (x >> 16);
return x+1;
}
class BIT_plus {
std::vector<int> in;
std::vector<int> num;
int n;
public:
BIT_plus(int n) {
init(n);
}
BIT_plus(const std::vector<int>& list) {
init(list);
}
void init(const std::vector<int>& list) {
init(list.size());
for(auto i=0u; i < list.size(); ++i) {
inc(i, list[i]);
}
}
std::vector<int> sum_dip(int start, int len) {
int summ = sum(start);
std::vector<int> res;
for(int i=start; i < start + len; i++) {
res.push_back(summ);
summ += num[i+1];
}
return res;
}
void init(int n) {
in.assign(n, 0);
num.assign(n, 0);
this>n = n;
}
int size() {
return n;
}
void inc (int i, int delta) {
num[i] += delta;
for (; i < n; i = (i  (i+1)))
in[i] += delta;
}
int sum (int r) const {
int result = 0;
for (; r >= 0; r = (r & (r+1))  1)
result += in[r];
return result;
}
int sum (int l, int r) const{
return sum (r)  sum (l1);
}
void fill_const(int start, int len, int val) {
int n2n = next2n(start + len + 1)1;
int sumdelta = 0;
for(int i = start; i < start + len; i++) {
int delta = val  num[i];
sumdelta += delta;
for(int j = i; j < n2n; j = (j(j+1))) {
in[j] += delta;
}
num[i] = val;
}
for(int i = n2n; i < n; i = (i(i+1))) {
in[i] += sumdelta;
}
}
auto get_num() const {
return num;
}
auto get_in() const {
return in;
}
};
It is a little inflated by unnecessary methods, but it contains all the most important things. (I took the implementation from here for some basis). In it, in comparison with the usual implementation, there are two new methods: fill_const
(fills the range of the original sequence with a certain number) and sum_range
(returns the prefix sums of elements from the range).
The basic idea behind the implementation is pretty simple. It lies in the fact that in addition to the array itself for the Fenwick tree ( in
), we also store the original elements ( num
) themselves.
1. Then finding all prefix sums from the range [i, i+k]
is as follows: first we find the prefix sum for the i
th element. This happens as in the normal implementation. The computation complexity will be O(log(n))
. And then, to find sum(i+1)
add the number num[i+1]
. Indeed, if sum(i)
is the sum of the first i
elements, then sum(i) + num[i+1]
is the sum of the first i+1
elements.
2. The second part with constant filling is a little more complicated. First will understand what elements we need to change the original array in
. If we change the element with index i
, then first we must change the element with index i
, then f(i), f(f(i))
, where f(i) = i(i+1)
( 
– bitwise or). Note that if at some point we got into 2 ^{k} 1, then further we will also get into 2 ^{i} 1. Moreover, if we start from a certain number i
, then we will not skip 2 into the next number of the form 2 ^{k} 1 (otherwise, how will we get one in the next digit, and our function is monotonically increasing). We will use it like this: from the number j
from [i, i+k)
we will change the cells (according to the standard rules. Adding delta
to them) until the number of the cell that we change is less than the next number after i+k
the form 2 ^{k} 1 ( next2n(i+k)
). Further, the indices of the array, which we will change for all j
will be the same. And we will change them by the sum of the changes in all cells.
Probably, the second algorithm can be optimized somewhat (now it seems to work in O(k*log(i)+log(N))
, which is asymptotically not faster ( i
may be of the order of n
), but faster in practice, especially for small i
) , but already in such an implementation it will work faster than an independent change for each variable.