Dot product in Julia

Wouter M. Koolen

2021-02-08

Abstract

We implement and profile the dot product in Julia.

Introduction

I have been trying out Julia as a replacement for octave. One reason is that several of my collaborators (thanks Emilie, Aurélien and Rémy) are fans. I heard great things about its speed and its natural fit for the type of one-page programs one tends to try out research ideas.

In one of my projects, I found out by profiling that it was bottlenecked by the computation of the dot product between two vectors. In this project the vectors are of fixed and rather small length. Dimensions between \(3\) and \(10\) are common, and \(100\) is the exception. I had written the dot product between vectors \(x\) and \(y\) as follows

  sum(x .* y)

To me this looks natural. One sums the entry-wise products. How could one do this any faster?

Lessons

In this post we dig into accelerating the dot product, and learn some surprising lessons about Julia.

Method

I coded 13 different versions of the dot product. Each version is evaluated on lengths \(1\)\(2000\). I also evaluate each version on four types of inputs:

Many Julia commands return arrays, like randn(5,3). The main difference (in the context of this post) between a Range and a Generator is that the former allows indexing, i.e. (1:5)[3] while the latter only allows iteration (see an interesting read on trying to add indexing to generators).

Results

The implementations are listed below. Here are the results. The measurements are performed on my Debian laptop. I’m using Julia version 1.5.3, on a Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz.

Length 3

Here is the run-time for vectors of length \(3\).

Run time per dot product for vectors of length 3. Each bar is the average of 3334000 executions.

Computing the dot product should not require allocating memory. Yet some of the high runtime could be caused by it. Here are the total memory allocations in bytes caused by each implementation.

Number of bytes allocated per dot product computation for vectors of length \(3\). This is averaged over \(3334000\) repetitions.
Array Range Tuple Generator
builtin_dot 0 0 0 0
sumdot 112 112 0 336
sumzip 0 0 0 0
msmzip 0 0 0 0
sumix 0 0 0 4.8e-5
msmix 0 0 0 4.8e-5
blas 0 0 2.9e-5 4.8e-5
forea 240 368 128 368
iter 0 0 0 0
accixz 0 0 0 4.8e-5
acczip 0 0 0 0
accix 0 0 0 4.8e-5
mapred 0 0 0 0
lazysumdot 0 0 0 224

What is going on with the very small numbers? These are small allocations happening only once, and not once per dot product (\(2.9 \cdot 10^{-5}\) maps to \(96\) and \(4.8 \cdot 10^{-5}\) maps to \(160\)). Julia is compiled Just-In-Time, and I do make sure to run each function once before starting the measurement. So it is not JIT overhead. I don’t currently have an explanation. Curious!

One thing to notice is how much the rather natural sumdot underwhelms.

Length 2000

Here is the run-time for vectors of length \(2000\).

Run time per dot product for vectors of length 2000. Each bar is the average of 5001 executions.

And here are the memory allocation details.

Number of bytes allocated per dot product computation for vectors of length \(2000\). This is averaged over \(5001\) repetitions.
Array Range Tuple Generator
builtin_dot 0 0 0 0
sumdot 16128 16128 144448 48384
sumzip 0 0 0 0
msmzip 0 0 0 0
sumix 0 0 0 0.032
msmix 0 0 0 0.032
blas 0 0 6.4 0.032
forea 128048 128176 64032 128176
iter 0 0 0 0
accixz 0 0 0 0.032
acczip 0 0 0 0
accix 0 0 0 0.032
mapred 0 0 0 0
lazysumdot 0 0 0 32256

Scaling

This is how each method scales with the input length. We see that for large inputs, BLAS and dot are the winners, closely followed by msmix and lazysumdot.

Run time as a function of length for Array inputs.

Implementations

builtin_dot

The build-in function dot. Also writable as infix \cdot (which renders as \(\cdot\)).

[literate={·}{$\cdot$}1]
function builtin_dot(x, y)
    x · y;
end

sumdot

The most straightforward implementation.

function sumdot(x, y)
    sum(x.*y);
end

sumzip

Using iterator zipping

function sumzip(x, y)
    sum(v * w for (v,w) in zip(x,y));
end

msmzip

Using iterator zipping, combined with the ability of sum to apply a function.

function msmzip(x, y)
    sum(((v,w),) -> v*w, zip(x,y));
end

sumix

Iterate over the indices explicitly.

function sumix(x, y)
    sum(@inbounds(x[k] * y[k]) for k in eachindex(x));
end

msmix

Map the pairwise product function over indices.

function msmix(x, y)
    sum(k -> @inbounds(x[k] * y[k]), eachindex(x));
end

blas

Defer to BLAS (the Basic Linear Algebra Subsystem)

function blas(x, y)
    x'y;
end

iter

Manual dot product using two iterators.

function iter(x, y) # arbitrary iterables
    ix = iterate(x)
    iy = iterate(y)
    s = 0.
    while ix !== nothing && iy !== nothing
        (vx, xs), (vy, ys) = ix, iy
        s += vx * vy
        ix = iterate(x, xs)
        iy = iterate(y, ys)
    end
    if !(iy === nothing && ix === nothing)
        throw(DimensionMismatch("x and y are of different lengths!"))
    end
    return s
end

accixz

Manual dot product using zipped indices.

function accixz(x, y)
    lx = length(x)
    if lx != length(y)
        throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second, $(length(y))."))
    end
    s = 0.
    for (Ix, Iy) in zip(eachindex(x), eachindex(y))
        @inbounds s += x[Ix] * y[Iy]
    end
    s
end

accix

Manual dot product using explicit indices.

function accix(x, y)
    lx = length(x)
    if lx != length(y)
        throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second, $(length(y))."))
    end
    s = 0.
    for k in 1:lx
        @inbounds s += x[k]*y[k]
    end
    s
end

acczip

Manual dot product using zipped collections.

function acczip(x, y)
    lx = length(x)
    if lx != length(y)
        throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second, $(length(y))."))
    end
    s = 0.
    for (Ix, Iy) in zip(x, y)
        s += Ix * Iy
    end
    s
end

forea

Manual dot product using the foreach function.

function forea(x, y)
    s = 0.
    foreach((v,w) -> s += v*w, x, y)
    s
end

mapred

Dot product using mapreduce.

function mapred(x, y)
    mapreduce(((a,b),)->a*b, +, zip(x, y))
end

lazysumdot

Dot product using sum, with the LazyArrays “@\(\sim\)” non-materializing broadcast macro.

function lazysumdot(x, y)
    sum(@~ x.*y);
end

Thoughts

Many follow-up questions arise.

What we may need is a lazy map, or a non-materialising broadcast operation. The julia discussion on this topic is public, see for example