Processing math: 100%

Dot product in Julia

Wouter M. Koolen

2021-02-08

Abstract

We implement and profile the dot product in Julia.

Introduction

I have been trying out Julia as a replacement for octave. One reason is that several of my collaborators (thanks Emilie, Aurélien and Rémy) are fans. I heard great things about its speed and its natural fit for the type of one-page programs one tends to try out research ideas.

In one of my projects, I found out by profiling that it was bottlenecked by the computation of the dot product between two vectors. In this project the vectors are of fixed and rather small length. Dimensions between $3$ and $10$ are common, and $100$ is the exception. I had written the dot product between vectors $x$ and $y$ as follows

  sum(x .* y)

To me this looks natural. One sums the entry-wise products. How could one do this any faster?

Lessons

In this post we dig into accelerating the dot product, and learn some surprising lessons about Julia.

The natural implementation is not the fastest.
Finding the fastest implementation is not straightforward.
The run-times of different implementations of the dot product are strikingly diverse.

Method

I coded 13 different versions of the dot product. Each version is evaluated on lengths $1$ – $2000$ . I also evaluate each version on four types of inputs:

Array [1,2,3].
Tuple (1,2,3).
Range 1:3.
Generator (x^2 for x in 1:3).

Many Julia commands return arrays, like randn(5,3). The main difference (in the context of this post) between a Range and a Generator is that the former allows indexing, i.e. (1:5)[3] while the latter only allows iteration (see an interesting read on trying to add indexing to generators).

Results

The implementations are listed below. Here are the results. The measurements are performed on my Debian laptop. I’m using Julia version 1.5.3, on a Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz.

Length 3

Here is the run-time for vectors of length $3$ .

Computing the dot product should not require allocating memory. Yet some of the high runtime could be caused by it. Here are the total memory allocations in bytes caused by each implementation.

Number of bytes allocated per dot product computation for vectors of length $3$ . This is averaged over $3334000$ repetitions.
	Array	Range	Tuple	Generator
builtin_dot	0	0	0	0
sumdot	112	112	0	336
sumzip	0	0	0	0
msmzip	0	0	0	0
sumix	0	0	0	4.8e-5
msmix	0	0	0	4.8e-5
blas	0	0	2.9e-5	4.8e-5
forea	240	368	128	368
iter	0	0	0	0
accixz	0	0	0	4.8e-5
acczip	0	0	0	0
accix	0	0	0	4.8e-5
mapred	0	0	0	0
lazysumdot	0	0	0	224

What is going on with the very small numbers? These are small allocations happening only once, and not once per dot product ( $2.9 \cdot 10^{-5}$ maps to $96$ and $4.8 \cdot 10^{-5}$ maps to $160$ ). Julia is compiled Just-In-Time, and I do make sure to run each function once before starting the measurement. So it is not JIT overhead. I don’t currently have an explanation. Curious!

One thing to notice is how much the rather natural sumdot underwhelms.

Length 2000

Here is the run-time for vectors of length $2000$ .

And here are the memory allocation details.

Number of bytes allocated per dot product computation for vectors of length $2000$ . This is averaged over $5001$ repetitions.
	Array	Range	Tuple	Generator
builtin_dot	0	0	0	0
sumdot	16128	16128	144448	48384
sumzip	0	0	0	0
msmzip	0	0	0	0
sumix	0	0	0	0.032
msmix	0	0	0	0.032
blas	0	0	6.4	0.032
forea	128048	128176	64032	128176
iter	0	0	0	0
accixz	0	0	0	0.032
acczip	0	0	0	0
accix	0	0	0	0.032
mapred	0	0	0	0
lazysumdot	0	0	0	32256

Scaling

This is how each method scales with the input length. We see that for large inputs, BLAS and dot are the winners, closely followed by msmix and lazysumdot.

Run time as a function of length for Array inputs.

Implementations

builtin_dot

The build-in function dot. Also writable as infix \cdot (which renders as $\cdot$ ).

[literate={·}{$\cdot$}1]
function builtin_dot(x, y)
    x · y;
end

sumdot

The most straightforward implementation.

function sumdot(x, y)
    sum(x.*y);
end

sumzip

Using iterator zipping

function sumzip(x, y)
    sum(v * w for (v,w) in zip(x,y));
end

msmzip

Using iterator zipping, combined with the ability of sum to apply a function.

function msmzip(x, y)
    sum(((v,w),) -> v*w, zip(x,y));
end

sumix

Iterate over the indices explicitly.

function sumix(x, y)
    sum(@inbounds(x[k] * y[k]) for k in eachindex(x));
end

msmix

Map the pairwise product function over indices.

function msmix(x, y)
    sum(k -> @inbounds(x[k] * y[k]), eachindex(x));
end

blas

Defer to BLAS (the Basic Linear Algebra Subsystem)

function blas(x, y)
    x'y;
end

iter

Manual dot product using two iterators.

function iter(x, y) # arbitrary iterables
    ix = iterate(x)
    iy = iterate(y)
    s = 0.
    while ix !== nothing && iy !== nothing
        (vx, xs), (vy, ys) = ix, iy
        s += vx * vy
        ix = iterate(x, xs)
        iy = iterate(y, ys)
    end
    if !(iy === nothing && ix === nothing)
        throw(DimensionMismatch("x and y are of different lengths!"))
    end
    return s
end

accixz

Manual dot product using zipped indices.

function accixz(x, y)
    lx = length(x)
    if lx != length(y)
        throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second, $(length(y))."))
    end
    s = 0.
    for (Ix, Iy) in zip(eachindex(x), eachindex(y))
        @inbounds s += x[Ix] * y[Iy]
    end
    s
end

accix

Manual dot product using explicit indices.

function accix(x, y)
    lx = length(x)
    if lx != length(y)
        throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second, $(length(y))."))
    end
    s = 0.
    for k in 1:lx
        @inbounds s += x[k]*y[k]
    end
    s
end

acczip

Manual dot product using zipped collections.

function acczip(x, y)
    lx = length(x)
    if lx != length(y)
        throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second, $(length(y))."))
    end
    s = 0.
    for (Ix, Iy) in zip(x, y)
        s += Ix * Iy
    end
    s
end

forea

Manual dot product using the foreach function.

function forea(x, y)
    s = 0.
    foreach((v,w) -> s += v*w, x, y)
    s
end

mapred

Dot product using mapreduce.

function mapred(x, y)
    mapreduce(((a,b),)->a*b, +, zip(x, y))
end

lazysumdot

Dot product using sum, with the LazyArrays “@ $\sim$ ” non-materializing broadcast macro.

function lazysumdot(x, y)
    sum(@~ x.*y);
end

Thoughts

Many follow-up questions arise.

What of mixed argument types, say inner product between an Array and a Tuple?
From a compiler perspective, why is this difficult? Why do these implementations, when all abstractions are instantiated, not compile to the same machine code?
What of element-wise products between three vectors? We can do
```
    sum(x .* y .* z)
```
but now there are no built-in BLAS calls. How to get these fast?
One thing I have not investigated is whether all implementations have the same accuracy. I would believe yes: all implementations accumulate the element-wise products in order. But one might imagine to try and maximize precision, for example by computing the dot product in a carefully considered order (to process cancellations between positive and negative large terms before adding small terms). This would likely require sorting the entries, and that would stand out in the run-time.

What we may need is a lazy map, or a non-materialising broadcast operation. The julia discussion on this topic is public, see for example