2021-02-08
I have been trying out Julia as a replacement for octave. One reason is that several of my collaborators (thanks Emilie, Aurélien and Rémy) are fans. I heard great things about its speed and its natural fit for the type of one-page programs one tends to try out research ideas.
In one of my projects, I found out by profiling that it was bottlenecked by the computation of the dot product between two vectors. In this project the vectors are of fixed and rather small length. Dimensions between \(3\) and \(10\) are common, and \(100\) is the exception. I had written the dot product between vectors \(x\) and \(y\) as follows
sum(x .* y)
To me this looks natural. One sums the entry-wise products. How could one do this any faster?
In this post we dig into accelerating the dot product, and learn some surprising lessons about Julia.
The natural implementation is not the fastest.
Finding the fastest implementation is not straightforward.
The run-times of different implementations of the dot product are strikingly diverse.
I coded 13 different versions of the dot product. Each version is evaluated on lengths \(1\)–\(2000\). I also evaluate each version on four types of inputs:
Array [1,2,3]
.
Tuple (1,2,3)
.
Range 1:3
.
Generator (x^2 for x in 1:3)
.
Many Julia commands return arrays, like randn(5,3)
. The main difference (in the context of this post) between a Range and a Generator is that the former allows indexing, i.e. (1:5)[3]
while the latter only allows iteration (see an interesting read on trying to add indexing to generators).
The implementations are listed below. Here are the results. The measurements are performed on my Debian laptop. I’m using Julia version 1.5.3, on a Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz.
Here is the run-time for vectors of length \(3\).
Computing the dot product should not require allocating memory. Yet some of the high runtime could be caused by it. Here are the total memory allocations in bytes caused by each implementation.
Array | Range | Tuple | Generator | |
builtin_dot | 0 | 0 | 0 | 0 |
sumdot | 112 | 112 | 0 | 336 |
sumzip | 0 | 0 | 0 | 0 |
msmzip | 0 | 0 | 0 | 0 |
sumix | 0 | 0 | 0 | 4.8e-5 |
msmix | 0 | 0 | 0 | 4.8e-5 |
blas | 0 | 0 | 2.9e-5 | 4.8e-5 |
forea | 240 | 368 | 128 | 368 |
iter | 0 | 0 | 0 | 0 |
accixz | 0 | 0 | 0 | 4.8e-5 |
acczip | 0 | 0 | 0 | 0 |
accix | 0 | 0 | 0 | 4.8e-5 |
mapred | 0 | 0 | 0 | 0 |
lazysumdot | 0 | 0 | 0 | 224 |
What is going on with the very small numbers? These are small allocations happening only once, and not once per dot product (\(2.9 \cdot 10^{-5}\) maps to \(96\) and \(4.8 \cdot 10^{-5}\) maps to \(160\)). Julia is compiled Just-In-Time, and I do make sure to run each function once before starting the measurement. So it is not JIT overhead. I don’t currently have an explanation. Curious!
One thing to notice is how much the rather natural sumdot
underwhelms.
Here is the run-time for vectors of length \(2000\).
And here are the memory allocation details.
Array | Range | Tuple | Generator | |
builtin_dot | 0 | 0 | 0 | 0 |
sumdot | 16128 | 16128 | 144448 | 48384 |
sumzip | 0 | 0 | 0 | 0 |
msmzip | 0 | 0 | 0 | 0 |
sumix | 0 | 0 | 0 | 0.032 |
msmix | 0 | 0 | 0 | 0.032 |
blas | 0 | 0 | 6.4 | 0.032 |
forea | 128048 | 128176 | 64032 | 128176 |
iter | 0 | 0 | 0 | 0 |
accixz | 0 | 0 | 0 | 0.032 |
acczip | 0 | 0 | 0 | 0 |
accix | 0 | 0 | 0 | 0.032 |
mapred | 0 | 0 | 0 | 0 |
lazysumdot | 0 | 0 | 0 | 32256 |
This is how each method scales with the input length. We see that for large inputs, BLAS and dot
are the winners, closely followed by msmix
and lazysumdot
.
The build-in function dot
. Also writable as infix \cdot
(which renders as \(\cdot\)).
[literate={·}{$\cdot$}1]
function builtin_dot(x, y)
x · y;
end
The most straightforward implementation.
function sumdot(x, y)
sum(x.*y);
end
Using iterator zipping
function sumzip(x, y)
sum(v * w for (v,w) in zip(x,y));
end
Using iterator zipping, combined with the ability of sum to apply a function.
function msmzip(x, y)
sum(((v,w),) -> v*w, zip(x,y));
end
Iterate over the indices explicitly.
function sumix(x, y)
sum(@inbounds(x[k] * y[k]) for k in eachindex(x));
end
Map the pairwise product function over indices.
function msmix(x, y)
sum(k -> @inbounds(x[k] * y[k]), eachindex(x));
end
Defer to BLAS (the Basic Linear Algebra Subsystem)
function blas(x, y)
x'y;
end
Manual dot product using two iterators.
function iter(x, y) # arbitrary iterables
ix = iterate(x)
iy = iterate(y)
s = 0.
while ix !== nothing && iy !== nothing
(vx, xs), (vy, ys) = ix, iy
s += vx * vy
ix = iterate(x, xs)
iy = iterate(y, ys)
end
if !(iy === nothing && ix === nothing)
throw(DimensionMismatch("x and y are of different lengths!"))
end
return s
end
Manual dot product using zipped indices.
function accixz(x, y)
lx = length(x)
if lx != length(y)
throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second, $(length(y))."))
end
s = 0.
for (Ix, Iy) in zip(eachindex(x), eachindex(y))
@inbounds s += x[Ix] * y[Iy]
end
s
end
Manual dot product using explicit indices.
function accix(x, y)
lx = length(x)
if lx != length(y)
throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second, $(length(y))."))
end
s = 0.
for k in 1:lx
@inbounds s += x[k]*y[k]
end
s
end
Manual dot product using zipped collections.
function acczip(x, y)
lx = length(x)
if lx != length(y)
throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second, $(length(y))."))
end
s = 0.
for (Ix, Iy) in zip(x, y)
s += Ix * Iy
end
s
end
Manual dot product using the foreach function.
function forea(x, y)
s = 0.
foreach((v,w) -> s += v*w, x, y)
s
end
Dot product using mapreduce.
function mapred(x, y)
mapreduce(((a,b),)->a*b, +, zip(x, y))
end
Dot product using sum, with the LazyArrays “@\(\sim\)” non-materializing broadcast macro.
function lazysumdot(x, y)
sum(@~ x.*y);
end
Many follow-up questions arise.
What of mixed argument types, say inner product between an Array and a Tuple?
From a compiler perspective, why is this difficult? Why do these implementations, when all abstractions are instantiated, not compile to the same machine code?
What of element-wise products between three vectors? We can do
sum(x .* y .* z)
but now there are no built-in BLAS calls. How to get these fast?
One thing I have not investigated is whether all implementations have the same accuracy. I would believe yes: all implementations accumulate the element-wise products in order. But one might imagine to try and maximize precision, for example by computing the dot product in a carefully considered order (to process cancellations between positive and negative large terms before adding small terms). This would likely require sorting the entries, and that would stand out in the run-time.
What we may need is a lazy map, or a non-materialising broadcast operation. The julia discussion on this topic is public, see for example