Tutorial

There are two ways to compute gradients with Mooncake.jl:

through the standardised DifferentiationInterface.jl API
through the native Mooncake.jl API

We recommend the former to start with, especially if you want to experiment with other automatic differentiation packages.

import DifferentiationInterface as DI
import Mooncake

DifferentiationInterface.jl API

DifferentiationInterface.jl (or DI for short) provides a common entry point for every automatic differentiation package in Julia. To specify that you want to use Mooncake.jl, just create the right "backend" object (with an optional Mooncake.Config):

backend = DI.AutoMooncake(; config=nothing)

ADTypes.AutoMooncake()

This object is actually defined by a third package called ADTypes.jl, but re-exported by DI.

Friendly tangents

By default, Mooncake will use internal types such as Mooncake.Tangent to represent tangents (gradients or derivatives) of most Julia types (e.g. complex numbers, symmetric matrices). To represent tangents using the same types as the original function, such that the tangent of a ComplexF64 is a ComplexF64, or the tangent of a Symmetric is a Symmetric, set friendly_tangents=true in the config:

backend = DI.AutoMooncake(; config=Mooncake.Config(; friendly_tangents=true))

ADTypes.AutoMooncake(config=Mooncake.Config(false, false, true))

Single argument

Suppose you want to differentiate the following function

f(x) = sum(abs2, x)

f (generic function with 1 method)

on the following input

x = float.(1:3)

3-element Vector{Float64}:
 1.0
 2.0
 3.0

The naive way is to simply call DI.gradient:

DI.gradient(f, backend, x)  # slow, do not do this

3-element Vector{Float64}:
 2.0
 4.0
 6.0

This returns the correct gradient, but it is very slow because it includes the time taken by Mooncake.jl to compute a differentiation rule for f (see Mooncake.jl's Rule System). If you anticipate you will need more than one gradient, it is better to call DI.prepare_gradient on a typical (e.g. random) input first:

typical_x = rand(3)
prep = DI.prepare_gradient(f, backend, typical_x)

DifferentiationInterfaceMooncakeExt.MooncakeGradientPrep{Nothing, Mooncake.Cache{Mooncake.DerivedRule{Tuple{typeof(Main.f), Vector{Float64}}, Tuple{Mooncake.CoDual{typeof(Main.f), Mooncake.NoFData}, Mooncake.CoDual{Vector{Float64}, Vector{Float64}}}, Mooncake.CoDual{Float64, Mooncake.NoFData}, Tuple{Float64}, Tuple{Mooncake.NoRData, Mooncake.NoRData}, false, Val{2}}, Nothing, Tuple{Mooncake.NoTangent, Vector{Float64}}, Nothing, Nothing}}(Val{Nothing}(), Mooncake.Cache{Mooncake.DerivedRule{Tuple{typeof(Main.f), Vector{Float64}}, Tuple{Mooncake.CoDual{typeof(Main.f), Mooncake.NoFData}, Mooncake.CoDual{Vector{Float64}, Vector{Float64}}}, Mooncake.CoDual{Float64, Mooncake.NoFData}, Tuple{Float64}, Tuple{Mooncake.NoRData, Mooncake.NoRData}, false, Val{2}}, Nothing, Tuple{Mooncake.NoTangent, Vector{Float64}}, Nothing, Nothing}(Mooncake.DerivedRule{Tuple{typeof(Main.f), Vector{Float64}}, Tuple{Mooncake.CoDual{typeof(Main.f), Mooncake.NoFData}, Mooncake.CoDual{Vector{Float64}, Vector{Float64}}}, Mooncake.CoDual{Float64, Mooncake.NoFData}, Tuple{Float64}, Tuple{Mooncake.NoRData, Mooncake.NoRData}, false, Val{2}}(MistyClosure (::Mooncake.CoDual{typeof(Main.f), Mooncake.NoFData}, ::Mooncake.CoDual{Vector{Float64}, Vector{Float64}})::Mooncake.CoDual{Float64, Mooncake.NoFData}->◌, Base.RefValue{MistyClosures.MistyClosure{Core.OpaqueClosure{Tuple{Float64}, Tuple{Mooncake.NoRData, Mooncake.NoRData}}}}(MistyClosure (::Float64)::Tuple{Mooncake.NoRData, Mooncake.NoRData}->◌), Val{2}()), nothing, (Mooncake.NoTangent(), [0.0, 0.0, 0.0]), nothing, nothing))

The typical input should have the same size and type as the actual inputs we will provide later on. As for the contents of the preparation result, they do not matter. What matters is that it captures everything you need for DI.gradient to be fast:

DI.gradient(f, prep, backend, x)  # fast

3-element Vector{Float64}:
 2.0
 4.0
 6.0

For optimal speed, you can provide storage space for the gradient and call DI.gradient! instead:

grad = similar(x)
DI.gradient!(f, grad, prep, backend, x)  # very fast

3-element Vector{Float64}:
 2.0
 4.0
 6.0

If you also need the value of the function, check out DI.value_and_gradient or DI.value_and_gradient!:

DI.value_and_gradient(f, prep, backend, x)

(14.0, [2.0, 4.0, 6.0])

Multiple arguments

What should you do if your function takes more than one input argument? Well, DI can still handle it, assuming that you only want the derivative with respect to one of them (the first one, by convention). For instance, consider the function

g(x, a, b) = a * f(x) + b

g (generic function with 1 method)

You can easily compute the gradient with respect to x, while keeping a and b fixed. To do that, just wrap these two arguments inside DI.Constant, like so:

typical_a, typical_b = 1.0, 1.0
prep = DI.prepare_gradient(g, backend, typical_x, DI.Constant(typical_a), DI.Constant(typical_b))

a, b = 42.0, 3.14
DI.value_and_gradient(g, prep, backend, x, DI.Constant(a), DI.Constant(b))

(591.14, [84.0, 168.0, 252.0])

Note that this works even when you change the value of a or b (those are not baked into the preparation result).

If one of your additional arguments behaves like a scratch space in memory (instead of a meaningful constant), you can use DI.Cache instead.

Now what if you care about the derivatives with respect to every argument? You can always go back to the single-argument case by putting everything inside a tuple:

g_tup(xab) = xab[2] * f(xab[1]) + xab[3]
prep = DI.prepare_gradient(g_tup, backend, (typical_x, typical_a, typical_b))
DI.value_and_gradient(g_tup, prep, backend, (x, a, b))

(591.14, ([84.0, 168.0, 252.0], 14.0, 1.0))

You can also use the native API of Mooncake.jl, discussed below.

Beyond gradients

Going through DI allows you to compute other kinds of derivatives, like (reverse-mode) Jacobian matrices. The syntax is very similar:

h(x) = cos.(x) .* sin.(reverse(x))
prep = DI.prepare_jacobian(h, backend, x)
DI.jacobian(h, prep, backend, x)

3×3 Matrix{Float64}:
 -0.118748   0.0       -0.534895
  0.0       -0.653644   0.0
 -0.534895   0.0       -0.118748

Mooncake.jl API

Mooncake.jl Functions

Mooncake.jl provides the following core differentiation functions:

Forward mode: Mooncake.value_and_derivative!! - computes function value and the Frechet derivative
Reverse mode: Mooncake.value_and_gradient!! - computes function value and gradient (when output is scalar)
Reverse mode: Mooncake.value_and_pullback!! - computes function value and pullback (general case)

Terminology Comparison with DifferentiationInterface.jl

Mooncake.jl discusses Frechet derivatives and their adjoints, as described in detail in Algorithmic Differentiation. This differs from the conventions used by DifferentiationInterface.jl and some other AD packages.

General cases:

Frechet derivative: In forward mode, Mooncake computes the Frechet derivative D f[x], which maps tangent vectors to tangent vectors. This corresponds to what DifferentiationInterface refers to as a "pushforward", and is implemented in Mooncake.value_and_derivative!!.
Adjoint of derivative and pullback: In reverse mode, Mooncake computes the adjoint D f[x]* of the Frechet derivative, which maps cotangent vectors backwards through the computation. This corresponds to what DifferentiationInterface calls a "pullback" and is implemented in Mooncake.value_and_pullback!!.

Special cases (scalar input/output):

Derivative: When the input is scalar, the Frechet derivative f'(x) = D f[x](v) with v = 1 gives the ordinary derivative. This corresponds to DI.derivative, while Mooncake lacks an equivalent API and handles this as a special case of Mooncake.value_and_derivative!!.
Gradient: When the output is scalar, the adjoint of the derivative applied to 1 gives the gradient ∇f. This corresponds to DI.gradient and is implemented in Mooncake.value_and_gradient!!.

Info

For a detailed mathematical treatment of these concepts, see Algorithmic Differentiation, particularly the sections on Derivatives.