Automatic Differentiation

Dual Numbers - Forward Mode AD

Any reader of this document will be familiar with complex numbers:

\[z = x + iy\] where \(i^2 = -1\)

The concept of dual numbers, proposed by Clifford, in 1873, is closely related:

\[z = a + {\epsilon}b\] where \({\epsilon}^2 = 0, \epsilon \ne 0\)

It can be proven that for any analytic¹ function,

\[f(a + {\epsilon}b) = f(a) + bf'(a){\epsilon}\]

Therefore, passing a dual number into our generic function, with \(b = 1\), will return, with minimal overhead, both the function value and the derivative at \(a\). This is the basis of forward mode automatic differentiation.

It is important to note that this is not an approximation of the derivative by finite differencing. Finite differencing, will estimate the derivative with two function calls:

\[f'(a) \approxeq \frac{f(a + h) - f(a)}{h}\]

The problem with this approach is that using too large a value for \(h\), give inaccuracies in the estimation, but using too small a value of \(h\) give inaccuracies through floating point rounding errors. There is therefore an optimal value of \(h\), which will depend on the value of the function at the point of evaluation. This value is generally both unknown and unknowable. Automatic differentiation does not have this problem and also has fewer function calls, so is more efficient as well as more accurate.

We can also easily expand the method to multiple dimensions by simply expanding the idea of a dual number:

\[z = a + {\epsilon}_1v_1 + {\epsilon}_2v_2 + {\epsilon}_3v_3 + ...\]

where each \(v\) is a basis vector in which direction we wish to calculate a derivative.

\[f(z) = f(a) + f'(a)v_1{\epsilon}_1+ f'(a)v_1{\epsilon}_1+ f'(a)v_1{\epsilon}_1\]

This works perfectly well for a simple analytic function. To extend this to more complicated functions, we simply apply the rules for differentiation:

\[c = a + b => dc = da + db\] \[c = a - b => dc = da - db\] \[c = a * b => dc = b*da + a*db\] \[dc = a / b => dc = (da*b - a*db)/b^2\] \[c = sin(a) => dc = cos(a)*da\] \[\text{etc.}\]

This is implemented by providing Julia with operator overloading for the dual numbers, i.e. telling Julia how to add, subtract etc. with dual numbers.

As a simple example, we define a dual number type and tell Julia how to add and divide dual numbers, so we can test it with the Babylonian algorithm for calculating square roots. This is an example given by Prof Alan Edelman, one of the designers of the Julia language in a YouTube video, that is well worth watching.

# Import the Base functions we are going to extend
import Base: +, /, convert, promote_rule

# Define our Dual number type as a sub-type of Number
struct Dual <: Number
    f::Tuple{Float64, Float64}
end

# Tell Julia how to add and divide with Dual
+(x::Dual, y::Dual) = Dual(x.f .+ y.f)
/(x::Dual, y::Dual) = Dual((x.f[1]/y.f[1], (y.f[1]*x.f[2] - x.f[1]*y.f[2])/y.f[1]^2))

# Tell Julia how to deal with type conversions
convert(::Type{Dual}, x::Real) = Dual((float(x), 0.0))
promote_rule(::Type{Dual}, ::Type{<:Number}) = Dual

"""
The Babylonian algorithm for square roots
-----------------------------------------
Initial guess is 1, then for each iteration, take the average of the current estimate, a,
and x/a. If the current estimate is too large, x/a will be too small and vice versa. 
Taking the average of the two brings us closer with each step. Usually, ten iterations is
more than enough.
"""
function babylonian(x, n = 10)
    a = (x + 1)/2
    for i in 2:n
        a = (a + x/a)/2
    end
    return a
end

# Test our algorithm
babylonian(2) # 1.414213562373095
err = babylonian(2) - √2 # -2.220446049250313e-16

# Create a dual number, with the epsilon part set to 1, to "seed" the derivative for the first variable
d = Dual((2., 1.)) # Dual((2.0, 1.0))

# Run the same code we just used for real numbers
res = babylonian(d) # Dual((1.414213562373095, 0.35355339059327373))

# Check the results
res.f[1] - √2 # -2.220446049250313e-16
res.f[2] - 0.5/√2 # 0.0

Here, we have defined the conversion of a constant to a Dual by setting the derivative part to zero, since the derivative of a constant is zero.

When we call the function, however, we set the \(b\) (from \(z = a + {\epsilon}b\)) to one, since we want to calculate the derivative with regards to this variable. If this was a multivariate function, we would pick which derivative we wanted by which of the \(b\) values we set to one, with the others set to zero.

Reverse Mode AD

Forward mode AD is fairly easy to understand and implement, but it comes with a disadvantage. If we had \(n\) input variables, we would need \(n\) function calls to calculate the full gradient.

When fitting models to data, we usually do not care about the derivative in terms of the input, but rather in terms of the parameters of the model, of which there can be many. We can apply automatic differentiation here as well, by passing the parameters as dual numbers. For \(k\) parameters, we have a \(k+1\) dimensional “dual” number and \(k\) function calls.

When there are many parameters, it is possible to use a more efficient method than forward mode automatic differentiation.

Consider fitting a model. Our model has some outputs, \(w_i\), and some inputs, \(u_i\). We can use the chain rule to write the derivative of output \(w\) in terms of some yet-to-be-chosen variable, \(t\):

\[\frac{{\partial}w}{{\partial}t} = \sum_i{\frac{{\partial}w}{{\partial}u_i} \frac{{\partial}u_i}{{\partial}t}}\]

To get the derivative in terms of variable \(x\), we simply set \(t = x\).

The chain rule is symmetric - we can change the numerators and denominators around:

\[\frac{{\partial}s}{{\partial}u} = \sum_i{\frac{{\partial}w_i}{{\partial}u} \frac{{\partial}s}{{\partial}w_i}}\]

Where \(u\) is still an input, \(w\) an output, and \(s\) some yet-to-be-chosen variable. This time, however, we are calculating the derivative of \(s\) with regards to the input, \(u\). We have now reversed the positions: we are calculating the derivative of \(s\) with regards to input \(u\).

Let’s look at a simple example:

\[z = xy + sin(x)\]

We can define intermediate variables:

\[a = xy\] \[b = sin(x)\] \[z = a + b\]

Using the reverse chain-rule from above:

\[\frac{\partial s}{\partial b} = \frac{\partial z}{\partial b} \frac{\partial s}{\partial z} = \frac{\partial s}{\partial z}\]

\[\frac{\partial s}{\partial a} = \frac{\partial z}{\partial a} \frac{\partial s}{\partial z} = \frac{\partial s}{\partial z}\]

\[\frac{\partial s}{\partial y} = \frac{\partial a}{\partial y} \frac{\partial s}{\partial a} = x \frac{\partial s}{\partial a}\]

\[\frac{\partial s}{\partial x} = \frac{\partial a}{\partial x} \frac{\partial s}{\partial a} + \frac{\partial b}{\partial x} \frac{\partial s}{\partial b} = y \frac{\partial s}{\partial a} + cos(x) \frac{\partial s}{\partial b}\]

If we now substitute \(s = z\), we get:

\[\frac{\partial z}{\partial b} = \frac{\partial z}{\partial z} = 1\]

\[\frac{\partial z}{\partial a} = \frac{\partial z}{\partial z} = 1\]

\[\frac{\partial z}{\partial y} = x \frac{\partial z}{\partial a} = x\]

\[\frac{\partial z}{\partial x} = y \frac{\partial z}{\partial a} + cos(x) \frac{\partial z}{\partial b} = y + cos(x)\]

We have now managed to calculate both derivatives in a single pass! When training a neural network, that can have hundreds (small ANN) to billions of parameters (ChatGPT), being able to get the full gradient of the model output with regards to all of the model parameters in a single pass is a dramatic improvement in efficiency.

In general, when a model has many more outputs than inputs, forward mode automatic differentiation will be the more efficient method and when there are many more inputs than outputs, such as in machine learning, reverse mode will be more efficient. If your model is not large, or not called many times (such as when fitting parameters), you may not notice the difference.

Available packages

ForwardDiff.jl

One of the most used automatic differentiation packages in Julia is ForwardDiff.jl. It has a few limitations to its use:

It only works on Julia code - you cannot use it on C, Python or R code being called from Julia
Target functions must take only one input, which may be a vector, and return either a scalar or vector
The function must be written generically, i.e. no types must be specified. This is something that can quickly trip you up, when you for example, initialise a variable to 0.0 or 1.0, both of which are Float64. Rather use zero(x) and one(x), which will return a zero or one in the same type as x.
Input vectors must be a sub-type of AbstractArray, so some custom types may not be supported. Julia built-in arrays and StaticArrays are supported.

The package provides several functions, including:

ForwardDiff.derivative(f, x::Real): Calculate the derivative of f(x) in variable x at the current value of x

ForwardDiff.gradient(f, x::AbstractArray): Calculate the gradient of f(x) in each of the variables contained in vector x at the current value of x

ForwardDiff.jacobian(f, x::AbstractArray): Calculate the Jacobian matrix, where f(x) returns a vector of function values calculated from input vector x.

ForwardDiff.hessian(f, x::AbstractArray): Calculate the Hessian matrix, where f returns a scalar value, calculated from input vector x.

A simple example:

f(x) = 2x^2 - x
# f (generic function with 1 method)

df(x) = ForwardDiff.derivative(f, x)
# df (generic function with 1 method)

f(2)
# 6

df(2)
# 7

ForwardDiff does have the option to handle mutating functions, where the array result is returned in the first parameter. Several packages use ForwardDiff as an option integrated in solvers, including Roots.jl and Optim.jl, and some of the implicit differential equation solvers.

Zygote.jl

The Zygote package was developed as part of Flux.j - a deep learning package in Julia.

Zygote cannot handle mutating functions, try/catch statements for exception handling or calls to “foreign” code, like a C library. It works during the compilation phase, so generates very efficient code, using reverse mode automatic differentiation.

Using Zygote is fairly similar to using ForwardDiff.

gradient() - calculates the derivative of a function, with either scalar or vector inputs

jacobian() - calculates the Jacobian matrix for a system of equations, as well as hessian

hessian() - calculates the Hessian matrix for an equation with vector of inputs

There are also some additional utility functions. See the documentation for details.

Enzyme.jl

Enzyme is the future of automatic differentiation in Julia. It is a point of debate whether it is currently ready for general use, but if you use AD in your code, it will be good to keep an eye on this package.

ForwardDiff works by passing dual numbers to your code and using the ChainRules.jl package to supply the rules for differentiating more complex equations. Zygote goes a level deeper and generates the code for the derivatives during the compilation pass, using a process of source code transformation. The curious / adventurous can read the article for details.

Enzyme goes another level deeper and works on the LLVM code generated by the Julia compiler. The result is that there are few, if any, limitations on the code that can be differentiated and the resulting code is very, very fast. Enzyme itself is still under development. It is currently available in Julia and Rust. Enzyme can perform both forward and reverse mode differentiation.

There are several exported functions, but the ones most likely to be useful are:

gradient(): Calculates the gradient of am array-input function

gradient!(): Calculates the gradient of am array-input function and returns it by updating a passed array variable

jacobian(): calculates the Jacobian matrix of a set of equations

For all of these, the mode (forward or reverse) can be specified.

Watch this space!

Footnotes

Any function that is locally given by a convergent power series, such as a Taylor series↩︎