Matrix multiplication with iterator dependency - NumPy

Sometime back this question (now deleted but 10K+ rep users can still view it) was posted. It looked interesting to me and I learnt something new there while trying to solve it and I thought that's worth sharing. I would like to post those ideas/solutions and would love to see people post other possible ways to solve it. I am posting the gist of the question next.

So, we have two NumPy ndarrays a and b of shapes :

```a : (m,n,N)
b : (n,m,N)
```

Let's assume we are dealing with cases where m,n & N are comparable.

The problem is to solve the following multiplication and summation with focus on performance :

```def all_loopy(a,b):
P,Q,N = a.shape
d = np.zeros(N)
for i in range(N):
for j in range(i):
for k in range(P):
for n in range(Q):
d[i] += a[k,n,i] * b[n,k,j]
return d
```

I learnt few things along the way trying to find vectorized and faster ways to solve it.

1) First off, there is a dependency of iterators at "for j in range(i)". From my previous experience, especially with trying to solve such problems on MATLAB, it appeared that such dependency could be taken care of with a lower triangular matrix, so np.tril should work there. Thus, a fully vectorized solution and not so memory efficient solution (as it creates an intermediate (N,N) shaped array before finally reducing to (N,) shaped array) would be -

```def fully_vectorized(a,b):
return np.tril(np.einsum('ijk,jil->kl',a,b),-1).sum(1)
```

2) Next trick/idea was to keep one loop for the iterator i in for i in range(N), but insert that dependency with indexing and using np.einsum to perform all those multiplications and summations. The advantage would be memory-efficiency. The implementation would look like this -

```def einsum_oneloop(a,b):
d = np.zeros(N)
for i in range(N):
d[i] = np.einsum('ij,jik->',a[:,:,i],b[:,:,np.arange(i)])
return d
```

There are two more obvious ways to solve it. So, if we start working from the original all_loopy solution, one could keep the outer two loops and use np.einsum or np.tensordot to perform those operations and thus remove the inner two loops, like so -

```def tensordot_twoloop(a,b):
d = np.zeros(N)
for i in range(N):
for j in range(i):
d[i] += np.tensordot(a[:,:,i],b[:,:,j], axes=([1,0],[0,1]))
return d

def einsum_twoloop(a,b):
d = np.zeros(N)
for i in range(N):
for j in range(i):
d[i] += np.einsum('ij,ji->',a[:,:,i],b[:,:,j])
return d
```

Runtime test

Let's compare all the five approaches posted thus far to solve the problem, including the one posted in the question.

Case #1 :

```In [26]: # Input arrays with random elements
...: m,n,N = 20,20,20
...: a = np.random.rand(m,n,N)
...: b = np.random.rand(n,m,N)
...:

In [27]: %timeit all_loopy(a,b)
...: %timeit tensordot_twoloop(a,b)
...: %timeit einsum_twoloop(a,b)
...: %timeit einsum_oneloop(a,b)
...: %timeit fully_vectorized(a,b)
...:
10 loops, best of 3: 79.6 ms per loop
100 loops, best of 3: 4.97 ms per loop
1000 loops, best of 3: 1.66 ms per loop
1000 loops, best of 3: 585 µs per loop
1000 loops, best of 3: 684 µs per loop
```

Case #2 :

```In [28]: # Input arrays with random elements
...: m,n,N = 50,50,50
...: a = np.random.rand(m,n,N)
...: b = np.random.rand(n,m,N)
...:

In [29]: %timeit all_loopy(a,b)
...: %timeit tensordot_twoloop(a,b)
...: %timeit einsum_twoloop(a,b)
...: %timeit einsum_oneloop(a,b)
...: %timeit fully_vectorized(a,b)
...:
1 loops, best of 3: 3.1 s per loop
10 loops, best of 3: 54.1 ms per loop
10 loops, best of 3: 26.2 ms per loop
10 loops, best of 3: 27 ms per loop
10 loops, best of 3: 23.3 ms per loop
```

Case #3 (Leaving out all_loopy for being very slow) :

```In [30]: # Input arrays with random elements
...: m,n,N = 100,100,100
...: a = np.random.rand(m,n,N)
...: b = np.random.rand(n,m,N)
...:

In [31]: %timeit tensordot_twoloop(a,b)
...: %timeit einsum_twoloop(a,b)
...: %timeit einsum_oneloop(a,b)
...: %timeit fully_vectorized(a,b)
...:
1 loops, best of 3: 1.08 s per loop
1 loops, best of 3: 744 ms per loop
1 loops, best of 3: 568 ms per loop
1 loops, best of 3: 866 ms per loop
```

Going by the numbers, einsum_oneloop looks pretty good to me, whereas fully_vectorized could be used when dealing with small to decent sized arrays!

I'm not sure if you want it to be all-numpy but I've always used numba for slow but easy to implement loop-based functions. The speedup for loopy-intensive tasks is amazing. First I just numba.njitted your all_loopy variant which already gave me comperative results:

```m,n,N = 20,20,20
a = np.random.rand(m,n,N)
b = np.random.rand(n,m,N)

%timeit numba_all_loopy(a,b)
1000 loops, best of 3: 476 µs per loop # 3 times faster than everything else
%timeit tensordot_twoloop(a,b)
100 loops, best of 3: 16.1 ms per loop
%timeit einsum_twoloop(a,b)
100 loops, best of 3: 4.02 ms per loop
%timeit einsum_oneloop(a,b)
1000 loops, best of 3: 1.52 ms per loop
%timeit fully_vectorized(a,b)
1000 loops, best of 3: 1.67 ms per loop
```

Then I tested it against your 100, 100, 100 case:

```m,n,N = 100,100,100
a = np.random.rand(m,n,N)
b = np.random.rand(n,m,N)

%timeit numba_all_loopy(a,b)
1 loop, best of 3: 2.35 s per loop
%timeit tensordot_twoloop(a,b)
1 loop, best of 3: 3.54 s per loop
%timeit einsum_twoloop(a,b)
1 loop, best of 3: 2.58 s per loop
%timeit einsum_oneloop(a,b)
1 loop, best of 3: 2.71 s per loop
%timeit fully_vectorized(a,b)
1 loop, best of 3: 1.08 s per loop
```

Apart from noticing that my computer is much slower than yours - the numba version is getting slower. What happened?

Numpy uses compiled versions and depending on the compiler options numpy will probably optimize the looping while numba doesn't. So the next logical step is the optimize the looping. Assuming C-contiguous arrays the innermost loops should be the last axis of the arrays. It's the fastest changing axis so the cache locality will be better.

```@nb.njit
def numba_all_loopy2(a,b):
P,Q,N = a.shape
d = np.zeros(N)
# First axis a, second axis b
for k in range(P):
# first axis b, second axis a
for n in range(Q):
# third axis a
for i in range(N):
# third axis b
A = a[k,n,i] # so we have less lookups of the same variable
for j in range(i):
d[i] += A * b[n,k,j]
return d
```

so what are the timings of this "optimized" numba function? Can it compare with the others or even beat them?

```m = n = N = 20
%timeit numba_all_loopy(a,b)
1000 loops, best of 3: 476 µs per loop
%timeit numba_all_loopy2(a,b)
1000 loops, best of 3: 379 µs per loop # New one is a bit faster
```

so it's slightly faster for small matrices, what about big ones:

```m = n = N = 100
%timeit numba_all_loopy(a,b)
1 loop, best of 3: 2.34 s per loop
%timeit numba_all_loopy2(a,b)
1 loop, best of 3: 213 ms per loop # More then ten times faster now!
```

So we have an amazing speedup for large arrays. This function is now 4-5 times faster than your vectorized approaches and the result is the same.

But amazingly it seems that the ordering seems somehow dependant on the computer because the fully_vectorized is fastest where the einsum-options are faster on @Divakar's machine. So it might be open if these results are really that much faster.

Just for fun I tried it with n=m=N=200:

```%timeit numba_all_loopy2(a,b)
1 loop, best of 3: 3.38 s per loop  # still 5 times faster
%timeit einsum_oneloop(a,b)
1 loop, best of 3: 51.4 s per loop
%timeit fully_vectorized(a,b)
1 loop, best of 3: 16.7 s per loop
```