Do you remember your undergraduate calculus course? Frankly speaking, I don’t. But I do remember a big blackboard completely covered in chalk scribbles.
That was a nightmare! Hopefully, those
times are gone. I mean, why would we care about derivatives now?
Here at ELEKS, we do care about derivatives. We create and deliver precise models of various dynamic processes. The application field doesn’t matter: it may be physics, finances, or biology. What really matters is once you need to simulate something – most likely you’ll need to compute derivatives. And you need to compute them fast and with great precision. Thus, we created ADEL – a C++ template based-library that fits our needs. But let’s start with some background information.
Here at ELEKS, we do care about derivatives. We create and deliver precise models of various dynamic processes. The application field doesn’t matter: it may be physics, finances, or biology. What really matters is once you need to simulate something – most likely you’ll need to compute derivatives. And you need to compute them fast and with great precision. Thus, we created ADEL – a C++ template based-library that fits our needs. But let’s start with some background information.
The requirements listed above do not allow for solutions such as symbolic derivation or finite difference schemes. The first approach generates the exact formulae for derivatives of all levels and directions. This is a very memory/time consuming method. It produces enormous formulae. While the second one computes the numerical approximations of the value of a derivative. These suffer from various round-offs, cancellation and discretization errors.
Fortunately, there is a middle ground – automatic
differentiation (AD). The way it computes derivatives is essentially
through a chain
rule. It generates evaluations (but not formulii) of the derivatives. As the result, there are
none of the drawbacks(accuracy
loss) of numerical differentiation where the step size needs to be carefully chosen to
avoid errors. All intermediate expressions are evaluated as soon as possible; this saves memory, and removes the need for later
simplication. Moreover AD is not as complicated as symbolic
representation.
There are two modes of automatic differentiation – the
forward mode and the reverse mode – both
use different directions of the chain rules to propagate the derivative information. There are
two main implementations of automatic differentiations:
Operator overloading is straightforward and easy to
understand – one overloads the
operators not only to compute
function values but also to compute derivative information or
build up the computational graph.
Source transformation works like a compiler, which reads
in the code that computes the function, analyzes the code and exports a
source code that computes the derivatives. This implementation normally generates more optimized
programs but involves a lot of implementation efforts.
ADEL contains the implementation of operator overloaded
versions of both forward and reverse modes. This is an open source project; one
may find all the required materials at: https://github.com/eleks/ADEL.
The main feature of ADEL’s forward mode is careful
handmade optimization. Template based implementation with loop unrolling is
compiled into very efficient sequential code. This implementation suits GPU
computing architecture very well. The support of CUDA makes the solution truly
unique.
The reverse mode of the ADEL library is special
in the way it stores the computational graph. It uses the stack of the
application for this purpose. Each overloaded operation creates the data
structure that is placed on the top of the stack; the life-time of these
structures is just enough to compute the derivative data. This makes the
algorithm extremely efficient since the most of the operations could be
inlined.
We compared our implementation with some of already existing solutions
in order to see how it stands up against the others. We implemented Newton’s
method for testing the accuracy of our tools. For testing the performance, a random
expressions generator was used. It is capable of generating some monstrous
things (your professor would never dream of such things). Here is an example of
a test function:
template<typename
ADType>
ADType TestFunction(const
ADType(&x)[6]) {
y[0] =
x[1]; y[1] = -2.730; y[2] = 4.555;
for (int i = 0; i
< 1000; i++) {
ADType
y[3];
y[0] =
x[1]+x[5]+4.624/x[0]+(x[3]-y[0])/x[4]+(atan(-y[2])-x[2])*y[1];
y[1] =
(sin(y[1])-y[2])/x[2]/x[4]*(x[3]+log(x[0])/(y[0]-x[5]))+x[1];
y[2] =
atan(x[5])*(-y[2])/x[3]/(x[4]/(x[2]/x[1]+acos(y[1])))*y[0];
}
return 2.658/log(x[0])*(log(y[2])/x[4])*(x[1]+x[5])*(y[0]/y[1])*x[3];
};
The main characteristic of the test case is the number of independent
(x) and dependent (y) variables as well as total number of functions/operators.
The given task was to compute the gradient and hessian of the return value with
respect to all independent variables. In the example above, we have 6
independent and 3 temporary (dependent) variables that produce a single output.
We tested the following third-party AD implementations:
FADBAD (http://www.fadbad.com/fadbad.html) – a lightweight, user-friendly library. We used a combined mode: the gradient
was computed via the reverse mode, while the hessian – via the forward mode.
ADOL-C (http://www.coin-or.org/projects/ADOL-C.xml) – a huge math package that is capable of computing literally everything.
The reverse mode was used in the experiments.
There were four test cases of differing sizes with 5/10/20/30 input and
1/5/10/15 intermediate variables. The results were normalized, in such way that
the slowest execution was scored with 100 time units, therefore less is better.
Looking at the results one can see that we did not create an ultimate AD
tool - not yet. In smaller tests both ADEL modes greatly overrun their rivals.
When the number of variables increases, the performance of ADEL [Forward]
degrades because of the large memory overhead per variable. ADEL [Reverse]
shows the best average performance; FADBAD comes second; ADOL-C received no
prize at all.
Automatic differentiation is a very promising but extremely underused tool. The majority of the community still does not recognize it. This article is just a small but solid step into the bright future of AD tools. As a final word, we encourage you to download the source code and to experiment with ADEL on your own. If there is something we can improve, feel free to contact us.
Automatic differentiation is a very promising but extremely underused tool. The majority of the community still does not recognize it. This article is just a small but solid step into the bright future of AD tools. As a final word, we encourage you to download the source code and to experiment with ADEL on your own. If there is something we can improve, feel free to contact us.
Can your stack-based approach to AD handle loops in the functions? We do a lot of multivariate coding for probability models in Stan, for which we wrote our own forward and reverse-mode AD:
ReplyDeletehttp://mc-stan.org/
We focused on efficient vectorization and arena-based memory management, as well as unfolding derivatives for matrix and linear algebra ops and vectorized probability functions.
Hi,
ReplyDeleteyeah we can handle loops: only one single iteration of the loop is stored in a memory, but not whole computational graph. This is not canonical reverse mode, we call it hybrid, but is shows to be viable.
I didn't mention in the article but the test were performed under Win platform. That's why we didn't include so many of existing AD tools to compare with.
If you wish we can run some tests of yours(we'll go *nix), but not these meaningless random tests.