2021-02-08

# Introduction

I have been trying out Julia as a replacement for octave. One reason is that several of my collaborators (thanks Emilie, Aurélien and Rémy) are fans. I heard great things about its speed and its natural fit for the type of one-page programs one tends to try out research ideas.

In one of my projects, I found out by profiling that it was bottlenecked by the computation of the dot product between two vectors. In this project the vectors are of fixed and rather small length. Dimensions between $$3$$ and $$10$$ are common, and $$100$$ is the exception. I had written the dot product between vectors $$x$$ and $$y$$ as follows

  sum(x .* y)

To me this looks natural. One sums the entry-wise products. How could one do this any faster?

## Lessons

In this post we dig into accelerating the dot product, and learn some surprising lessons about Julia.

• The natural implementation is not the fastest.

• Finding the fastest implementation is not straightforward.

• The run-times of different implementations of the dot product are strikingly diverse.

## Method

I coded 13 different versions of the dot product. Each version is evaluated on lengths $$1$$$$2000$$. I also evaluate each version on four types of inputs:

• Array [1,2,3].

• Tuple (1,2,3).

• Range 1:3.

• Generator (x^2 for x in 1:3).

Many Julia commands return arrays, like randn(5,3). The main difference (in the context of this post) between a Range and a Generator is that the former allows indexing, i.e. (1:5)[3] while the latter only allows iteration (see an interesting read on trying to add indexing to generators).

# Results

The implementations are listed below. Here are the results. The measurements are performed on my Debian laptop. I’m using Julia version 1.5.3, on a Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz.

## Length 3

Here is the run-time for vectors of length $$3$$.

Computing the dot product should not require allocating memory. Yet some of the high runtime could be caused by it. Here are the total memory allocations in bytes caused by each implementation.

 Array Range Tuple Generator builtin_dot 0 0 0 0 sumdot 112 112 0 336 sumzip 0 0 0 0 msmzip 0 0 0 0 sumix 0 0 0 4.8e-5 msmix 0 0 0 4.8e-5 blas 0 0 2.9e-5 4.8e-5 forea 240 368 128 368 iter 0 0 0 0 accixz 0 0 0 4.8e-5 acczip 0 0 0 0 accix 0 0 0 4.8e-5 mapred 0 0 0 0 lazysumdot 0 0 0 224

What is going on with the very small numbers? These are small allocations happening only once, and not once per dot product ($$2.9 \cdot 10^{-5}$$ maps to $$96$$ and $$4.8 \cdot 10^{-5}$$ maps to $$160$$). Julia is compiled Just-In-Time, and I do make sure to run each function once before starting the measurement. So it is not JIT overhead. I don’t currently have an explanation. Curious!

One thing to notice is how much the rather natural sumdot underwhelms.

## Length 2000

Here is the run-time for vectors of length $$2000$$.

And here are the memory allocation details.

 Array Range Tuple Generator builtin_dot 0 0 0 0 sumdot 16128 16128 144448 48384 sumzip 0 0 0 0 msmzip 0 0 0 0 sumix 0 0 0 0.032 msmix 0 0 0 0.032 blas 0 0 6.4 0.032 forea 128048 128176 64032 128176 iter 0 0 0 0 accixz 0 0 0 0.032 acczip 0 0 0 0 accix 0 0 0 0.032 mapred 0 0 0 0 lazysumdot 0 0 0 32256

## Scaling

This is how each method scales with the input length. We see that for large inputs, BLAS and dot are the winners, closely followed by msmix and lazysumdot.

# Implementations

## builtin_dot

The build-in function dot. Also writable as infix \cdot (which renders as $$\cdot$$).

[literate={·}{$\cdot$}1]
function builtin_dot(x, y)
x · y;
end

## sumdot

The most straightforward implementation.

function sumdot(x, y)
sum(x.*y);
end

## sumzip

Using iterator zipping

function sumzip(x, y)
sum(v * w for (v,w) in zip(x,y));
end

## msmzip

Using iterator zipping, combined with the ability of sum to apply a function.

function msmzip(x, y)
sum(((v,w),) -> v*w, zip(x,y));
end

## sumix

Iterate over the indices explicitly.

function sumix(x, y)
sum(@inbounds(x[k] * y[k]) for k in eachindex(x));
end

## msmix

Map the pairwise product function over indices.

function msmix(x, y)
sum(k -> @inbounds(x[k] * y[k]), eachindex(x));
end

## blas

Defer to BLAS (the Basic Linear Algebra Subsystem)

function blas(x, y)
x'y;
end

## iter

Manual dot product using two iterators.

function iter(x, y) # arbitrary iterables
ix = iterate(x)
iy = iterate(y)
s = 0.
while ix !== nothing && iy !== nothing
(vx, xs), (vy, ys) = ix, iy
s += vx * vy
ix = iterate(x, xs)
iy = iterate(y, ys)
end
if !(iy === nothing && ix === nothing)
throw(DimensionMismatch("x and y are of different lengths!"))
end
return s
end

## accixz

Manual dot product using zipped indices.

function accixz(x, y)
lx = length(x)
if lx != length(y)
throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second,$(length(y))."))
end
s = 0.
for (Ix, Iy) in zip(eachindex(x), eachindex(y))
@inbounds s += x[Ix] * y[Iy]
end
s
end

## accix

Manual dot product using explicit indices.

function accix(x, y)
lx = length(x)
if lx != length(y)
throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second,$(length(y))."))
end
s = 0.
for k in 1:lx
@inbounds s += x[k]*y[k]
end
s
end

## acczip

Manual dot product using zipped collections.

function acczip(x, y)
lx = length(x)
if lx != length(y)
throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second,$(length(y))."))
end
s = 0.
for (Ix, Iy) in zip(x, y)
s += Ix * Iy
end
s
end

## forea

Manual dot product using the foreach function.

function forea(x, y)
s = 0.
foreach((v,w) -> s += v*w, x, y)
s
end

## mapred

Dot product using mapreduce.

function mapred(x, y)
mapreduce(((a,b),)->a*b, +, zip(x, y))
end

## lazysumdot

Dot product using sum, with the LazyArrays “@$$\sim$$” non-materializing broadcast macro.

function lazysumdot(x, y)
sum(@~ x.*y);
end

# Thoughts

Many follow-up questions arise.

• What of mixed argument types, say inner product between an Array and a Tuple?

• From a compiler perspective, why is this difficult? Why do these implementations, when all abstractions are instantiated, not compile to the same machine code?

• What of element-wise products between three vectors? We can do

    sum(x .* y .* z)

but now there are no built-in BLAS calls. How to get these fast?

• One thing I have not investigated is whether all implementations have the same accuracy. I would believe yes: all implementations accumulate the element-wise products in order. But one might imagine to try and maximize precision, for example by computing the dot product in a carefully considered order (to process cancellations between positive and negative large terms before adding small terms). This would likely require sorting the entries, and that would stand out in the run-time.

What we may need is a lazy map, or a non-materialising broadcast operation. The julia discussion on this topic is public, see for example