Tutorial
There are two ways to compute gradients with Mooncake.jl:
- through the standardised DifferentiationInterface.jl API
- through the native Mooncake.jl API
We recommend the former to start with, especially if you want to experiment with other automatic differentiation packages.
import DifferentiationInterface as DI
import Mooncake
DifferentiationInterface.jl API
DifferentiationInterface.jl (or DI for short) provides a common entry point for every automatic differentiation package in Julia. To specify that you want to use Mooncake.jl, just create the right "backend" object (with an optional Mooncake.Config
):
backend = DI.AutoMooncake(; config=nothing)
ADTypes.AutoMooncake()
This object is actually defined by a third package called ADTypes.jl, but re-exported by DI.
Single argument
Suppose you want to differentiate the following function
f(x) = sum(abs2, x)
f (generic function with 1 method)
on the following input
x = float.(1:3)
3-element Vector{Float64}:
1.0
2.0
3.0
The naive way is to simply call DI.gradient
:
DI.gradient(f, backend, x) # slow, do not do this
3-element Vector{Float64}:
2.0
4.0
6.0
This returns the correct gradient, but it is very slow because it includes the time taken by Mooncake.jl to compute a differentiation rule for f
(see Mooncake.jl's Rule System). If you anticipate you will need more than one gradient, it is better to call DI.prepare_gradient
on a typical (e.g. random) input first:
typical_x = rand(3)
prep = DI.prepare_gradient(f, backend, typical_x)
DifferentiationInterfaceMooncakeExt.MooncakeGradientPrep{Nothing, Mooncake.Cache{Mooncake.DerivedRule{Tuple{typeof(Main.f), Vector{Float64}}, Tuple{Mooncake.CoDual{typeof(Main.f), Mooncake.NoFData}, Mooncake.CoDual{Vector{Float64}, Vector{Float64}}}, Mooncake.CoDual{Float64, Mooncake.NoFData}, Tuple{Float64}, Tuple{Mooncake.NoRData, Mooncake.NoRData}, false, Val{2}}, Nothing, Tuple{Mooncake.NoTangent, Vector{Float64}}}}(Val{Nothing}(), Mooncake.Cache{Mooncake.DerivedRule{Tuple{typeof(Main.f), Vector{Float64}}, Tuple{Mooncake.CoDual{typeof(Main.f), Mooncake.NoFData}, Mooncake.CoDual{Vector{Float64}, Vector{Float64}}}, Mooncake.CoDual{Float64, Mooncake.NoFData}, Tuple{Float64}, Tuple{Mooncake.NoRData, Mooncake.NoRData}, false, Val{2}}, Nothing, Tuple{Mooncake.NoTangent, Vector{Float64}}}(Mooncake.DerivedRule{Tuple{typeof(Main.f), Vector{Float64}}, Tuple{Mooncake.CoDual{typeof(Main.f), Mooncake.NoFData}, Mooncake.CoDual{Vector{Float64}, Vector{Float64}}}, Mooncake.CoDual{Float64, Mooncake.NoFData}, Tuple{Float64}, Tuple{Mooncake.NoRData, Mooncake.NoRData}, false, Val{2}}(MistyClosure (::Mooncake.CoDual{typeof(Main.f), Mooncake.NoFData}, ::Mooncake.CoDual{Vector{Float64}, Vector{Float64}})::Mooncake.CoDual{Float64, Mooncake.NoFData}->◌, Base.RefValue{MistyClosures.MistyClosure{Core.OpaqueClosure{Tuple{Float64}, Tuple{Mooncake.NoRData, Mooncake.NoRData}}}}(MistyClosure (::Float64)::Tuple{Mooncake.NoRData, Mooncake.NoRData}->◌), Val{2}()), nothing, (Mooncake.NoTangent(), [0.0, 0.0, 0.0])))
The typical input should have the same size and type as the actual inputs we will provide later on. As for the contents of the preparation result, they do not matter. What matters is that it captures everything you need for DI.gradient
to be fast:
DI.gradient(f, prep, backend, x) # fast
3-element Vector{Float64}:
2.0
4.0
6.0
For optimal speed, you can provide storage space for the gradient and call DI.gradient!
instead:
grad = similar(x)
DI.gradient!(f, grad, prep, backend, x) # very fast
3-element Vector{Float64}:
2.0
4.0
6.0
If you also need the value of the function, check out DI.value_and_gradient
or DI.value_and_gradient!
:
DI.value_and_gradient(f, prep, backend, x)
(14.0, [2.0, 4.0, 6.0])
Multiple arguments
What should you do if your function takes more than one input argument? Well, DI can still handle it, assuming that you only want the derivative with respect to one of them (the first one, by convention). For instance, consider the function
g(x, a, b) = a * f(x) + b
g (generic function with 1 method)
You can easily compute the gradient with respect to x
, while keeping a
and b
fixed. To do that, just wrap these two arguments inside DI.Constant
, like so:
typical_a, typical_b = 1.0, 1.0
prep = DI.prepare_gradient(g, backend, typical_x, DI.Constant(typical_a), DI.Constant(typical_b))
a, b = 42.0, 3.14
DI.value_and_gradient(g, prep, backend, x, DI.Constant(a), DI.Constant(b))
(591.14, [84.0, 168.0, 252.0])
Note that this works even when you change the value of a
or b
(those are not baked into the preparation result).
If one of your additional arguments behaves like a scratch space in memory (instead of a meaningful constant), you can use DI.Cache
instead.
Now what if you care about the derivatives with respect to every argument? You can always go back to the single-argument case by putting everything inside a tuple:
g_tup(xab) = xab[2] * f(xab[1]) + xab[3]
prep = DI.prepare_gradient(g_tup, backend, (typical_x, typical_a, typical_b))
DI.value_and_gradient(g_tup, prep, backend, (x, a, b))
(591.14, ([84.0, 168.0, 252.0], 14.0, 1.0))
You can also use the native API of Mooncake.jl, discussed below.
Beyond gradients
Going through DI allows you to compute other kinds of derivatives, like (reverse-mode) Jacobian matrices. The syntax is very similar:
h(x) = cos.(x) .* sin.(reverse(x))
prep = DI.prepare_jacobian(h, backend, x)
DI.jacobian(h, prep, backend, x)
3×3 Matrix{Float64}:
-0.118748 0.0 -0.534895
0.0 -0.653644 0.0
-0.534895 0.0 -0.118748
Mooncake.jl API
Mooncake.jl Functions
Mooncake.jl provides the following core differentiation functions:
- Forward mode:
Mooncake.value_and_derivative!!
- computes function value and the Frechet derivative - Reverse mode:
Mooncake.value_and_gradient!!
- computes function value and gradient (when output is scalar) - Reverse mode:
Mooncake.value_and_pullback!!
- computes function value and pullback (general case)
Terminology Comparison with DifferentiationInterface.jl
Mooncake.jl discusses Frechet derivatives and their adjoints, as described in detail in Algorithmic Differentiation. This differs from the conventions used by DifferentiationInterface.jl and some other AD packages.
General cases:
Frechet derivative: In forward mode, Mooncake computes the Frechet derivative
D f[x]
, which maps tangent vectors to tangent vectors. This corresponds to what DifferentiationInterface refers to as a "pushforward", and is implemented inMooncake.value_and_derivative!!
.Adjoint of derivative and pullback: In reverse mode, Mooncake computes the adjoint
D f[x]*
of the Frechet derivative, which maps cotangent vectors backwards through the computation. This corresponds to what DifferentiationInterface calls a "pullback" and is implemented inMooncake.value_and_pullback!!
.
Special cases (scalar input/output):
Derivative: When the input is scalar, the Frechet derivative
f'(x) = D f[x](v)
withv = 1
gives the ordinary derivative. This corresponds toDI.derivative
, while Mooncake lacks an equivalent API and handles this as a special case ofMooncake.value_and_derivative!!
.Gradient: When the output is scalar, the adjoint of the derivative applied to
1
gives the gradient∇f
. This corresponds toDI.gradient
and is implemented inMooncake.value_and_gradient!!
.
For a detailed mathematical treatment of these concepts, see Algorithmic Differentiation, particularly the sections on Derivatives.