Interface
This is the public interface that day-to-day users of AD are expected to interact with if for some reason DifferentiationInterface.jl does not suffice. If you have not tried using Mooncake.jl via DifferentiationInterface.jl, please do so. See Tutorial for more info.
Example
Here's a simple example demonstrating how to use Mooncake.jl's native API:
import Mooncake as MC
struct SimplePair
x1::Float64
x2::Float64
end
# Define a simple function
g(x::SimplePair) = x.x1^2 + x.x2^2
# Where to evaluate the derivative
x_eval = SimplePair(1.0, 2.0)Main.SimplePair(1.0, 2.0)With friendly_tangents = false (the default), gradients for custom structures use a representation based on Mooncake.Tangent types. See Mooncake.jl's Rule System for more information.
cache = MC.prepare_gradient_cache(g, x_eval)
val, grad = MC.value_and_gradient!!(cache, g, x_eval)(5.0, (Mooncake.NoTangent(), Mooncake.Tangent{@NamedTuple{x1::Float64, x2::Float64}}((x1 = 2.0, x2 = 4.0))))This produces a tuple containing the value of the function (here 5.0) and the gradient. The first part of the gradient is the gradient wrt. g itself, here NoTangent() since g is not differentiable. The second part of the gradient is the gradient wrt. x; for the type SimplePair, its gradient is represented using a @NamedTuple{x1::Float64, x2::Float64} wrapped in a Tangent object. The gradient wrt. x1 can for example be retrieved with grad[2].fields.x1.
With friendly_tangents=true, gradients are returned in a more readable form:
cache = MC.prepare_gradient_cache(g, x_eval; config=MC.Config(friendly_tangents=true))
val, grad = MC.value_and_gradient!!(cache, g, x_eval)(5.0, (Mooncake.NoTangent(), (x1 = 2.0, x2 = 4.0)))The gradient wrt. x is now the NamedTuple (x1 = 2.0, x2 = 4.0).
In addition, there is an optional tuple-typed argument args_to_zero that specifies a true/false value for each argument (e.g., g, x_eval), allowing tangent zeroing to be skipped on a per-argument basis when the value is constant. Note that the first true/false entry specifies whether to zero the tangent of g; zeroing g's tangent is not always necessary, but is sometimes required for non-constant callable objects.
cache = MC.prepare_gradient_cache(g, x_eval; config=MC.Config(friendly_tangents=true))
val, grad = MC.value_and_gradient!!(
cache,
g,
x_eval;
args_to_zero = (false, true),
)(5.0, (Mooncake.NoTangent(), (x1 = 2.0, x2 = 4.0)))Aside: Any performance impact from using friendly_tangents = true should be very minor. If it is noticeable, something is likely wrong—please open an issue.
API Reference
Mooncake.Config — Type
Config(; debug_mode::Bool=false, silence_debug_messages::Bool=false, friendly_tangents::Bool=false)Configuration struct for use with ADTypes.AutoMooncake.
Keyword Arguments
debug_mode::Bool=false: whether or not to run additional type checks when differentiating a function. This has considerable runtime overhead, and should only be switched on if you are trying to debug something that has gone wrong in Mooncake.silence_debug_messages::Bool=false: iffalseanddebug_modeistrue, Mooncake will display some warnings that debug mode is enabled, in order to help prevent accidentally leaving debug mode on. If you wish to disable these messages, set this totrue.friendly_tangents::Bool=false: iftrue, Mooncake will represent tangents using the primal type at the interface level: the tangent type of a primal typePwill bePwhen using friendly tangents, andtangent_type(P)otherwise (e.g. the friendly tangent of a custom struct will be of the same type as the struct instead of Mooncake'sTangenttype). The tangent is converted from/to the friendly representation at the interface level, so all Mooncake internal computations and rule implementations always use thetangent_typerepresentation.
Mooncake.value_and_derivative!! — Function
value_and_derivative!!(cache::ForwardCache, f::Dual, x::Vararg{Dual,N})Returns a Dual containing the result of applying forward-mode AD to compute the (Frechet) derivative of primal(f) at the primal values in x in the direction of the tangent values in f and x.
value_and_derivative!!(cache::ForwardCache, (f, df), (x, dx), ...)Returns a tuple (y, dy) containing the result of applying forward-mode AD to compute the (Frechet) derivative of primal(f) at the primal values in x in the direction of the tangent values contained in df and dx.
Tuples are used as inputs and outputs instead of Dual numbers to accommodate the case where internal Mooncake tangent types do not coincide with tangents provided by the user (in which case we translate between "friendly tangents" and internal tangents using cache storage).
cache must be the output of prepare_derivative_cache, and (fields of) f and x must be of the same size and shape as those used to construct the cache. This is to ensure that the gradient can be written to the memory allocated when the cache was built.
cache owns any mutable state returned by this function, meaning that mutable components of values returned by it will be mutated if you run this function again with different arguments. Therefore, if you need to keep the values returned by this function around over multiple calls to this function with the same cache, you should take a copy (using copy or deepcopy) of them before calling again.
Mooncake.value_and_gradient!! — Method
value_and_gradient!!(cache::Cache, f, x...; args_to_zero=(true, ...))Computes a 2-tuple. The first element is f(x...), and the second is a tuple containing the gradient of f w.r.t. each argument. The first element is the gradient w.r.t any differentiable fields of f, the second w.r.t the first element of x, etc. If the cache was prepared with config.friendly_tangents=true, the pullback uses the same types as those of f and x. Otherwise, it uses the tangent types associated to f and x.
Assumes that f returns a Union{Float16, Float32, Float64}.
As with all functionality in Mooncake, if f modifes itself or x, value_and_gradient!! will return both to their original state as part of the process of computing the gradient.
cache must be the output of prepare_gradient_cache, and (fields of) f and x must be of the same size and shape as those used to construct the cache. This is to ensure that the gradient can be written to the memory allocated when the cache was built.
cache owns any mutable state returned by this function, meaning that mutable components of values returned by it will be mutated if you run this function again with different arguments. Therefore, if you need to keep the values returned by this function around over multiple calls to this function with the same cache, you should take a copy (using copy or deepcopy) of them before calling again.
The keyword argument args_to_zero is a tuple of boolean values specifying which cotangents should be reset to zero before differentiation. It contains one boolean for each element of (f, x...). It is used for performance optimizations if you can guarantee that the initial cotangent allocated in cache (created by zero_tangent) never needs to be zeroed out again.
Example Usage
f(x, y) = sum(x .* y)
x = [2.0, 2.0]
y = [1.0, 1.0]
cache = prepare_gradient_cache(f, x, y)
value_and_gradient!!(cache, f, x, y)
# output
(4.0, (NoTangent(), [1.0, 1.0], [2.0, 2.0]))Mooncake.value_and_pullback!! — Method
value_and_pullback!!(cache::Cache, ȳ, f, x...; args_to_zero=(true, ...))If f(x...) returns a scalar, you should use value_and_gradient!!, not this function.
Computes a 2-tuple. The first element is f(x...), and the second is a tuple containing the pullback of f applied to ȳ. The first element is the component of the pullback associated to any fields of f, the second w.r.t the first element of x, etc. If the cache was prepared with config.friendly_tangents=true, the pullback uses the same types as those of f and x. Otherwise, it uses the tangent types associated to f and x.
There are no restrictions on what y = f(x...) is permitted to return. However, ȳ must be an acceptable tangent for y. If the cache was prepared with config.friendly_tangents=false, this means that, for example, it must be true that tangent_type(typeof(y)) == typeof(ȳ). If the cache was prepared with config.friendly_tangents=true, then typeof(y) == typeof(ȳ).
As with all functionality in Mooncake, if f modifes itself or x, value_and_gradient!! will return both to their original state as part of the process of computing the gradient.
cache must be the output of prepare_pullback_cache, and (fields of) f and x must be of the same size and shape as those used to construct the cache. This is to ensure that the gradient can be written to the memory allocated when the cache was built.
cache owns any mutable state returned by this function, meaning that mutable components of values returned by it will be mutated if you run this function again with different arguments. Therefore, if you need to keep the values returned by this function around over multiple calls to this function with the same cache, you should take a copy (using copy or deepcopy) of them before calling again.
The keyword argument args_to_zero is a tuple of boolean values specifying which cotangents should be reset to zero before differentiation. It contains one boolean for each element of (f, x...). It is used for performance optimizations if you can guarantee that the initial cotangent allocated in cache (created by zero_tangent) never needs to be zeroed out again.
Example Usage
f(x, y) = sum(x .* y)
x = [2.0, 2.0]
y = [1.0, 1.0]
cache = Mooncake.prepare_pullback_cache(f, x, y)
Mooncake.value_and_pullback!!(cache, 1.0, f, x, y)
# output
(4.0, (NoTangent(), [1.0, 1.0], [2.0, 2.0]))Mooncake.prepare_derivative_cache — Function
prepare_derivative_cache(fx...; config=Mooncake.Config())Returns a cache used with value_and_derivative!!. See that function for more info.
Mooncake.prepare_gradient_cache — Function
prepare_gradient_cache(f, x...; config=Mooncake.Config())Returns a cache used with value_and_gradient!!. See that function for more info.
The API guarantees that tangents are initialized at zero before the first autodiff pass.
Mooncake.prepare_pullback_cache — Function
prepare_pullback_cache(f, x...; config=Mooncake.Config())Returns a cache used with value_and_pullback!!. See that function for more info.
The API guarantees that tangents are initialized at zero before the first autodiff pass.
Mooncake.prepare_hvp_cache — Function
prepare_hvp_cache(f, x...; config=Mooncake.Config())Prepare a cache for computing Hessian-vector products (HVPs) of f. Returns an HVPCache for use with value_and_hvp!!.
f must map x... to a scalar. Multiple arguments are supported: see value_and_hvp!! for the calling convention.
The cache compiles an outer forward-mode rule over an inner reverse-mode gradient. The inner rule is compiled only once regardless of how many HVPs are subsequently evaluated.
Note: cache is tied to the types and shapes of x.... Evaluating at a different point is fine, but changing the shapes requires a new cache.
f(x) = sum(x .* x)
x = [1.0, 2.0]
cache = Mooncake.prepare_hvp_cache(f, x)
f_val, gradient, hvp = Mooncake.value_and_hvp!!(cache, f, [1.0, 0.0], x)
f_val ≈ 5.0 && gradient ≈ [2.0, 4.0] && hvp ≈ [2.0, 0.0]
# output
trueMooncake.value_and_hvp!! — Function
value_and_hvp!!(cache::HVPCache, f, v, x...)Given a cache prepared by prepare_hvp_cache, compute the gradient of f at x... and the Hessian-vector product H v.
Single argument: v is the tangent direction; returns (f(x), ∇f(x), H(x)v). For f: Rⁿ → R with x::Vector{Float64}, the gradient and HVP are Vector{Float64}.
Multiple arguments: v must be a tuple of tangent directions (one per argument); returns (f(x...), (∇f_x1, ∇f_x2, ...), (h1, h2, ...)) where hk = ∑_j (∂²f/∂xk∂xj) v[j] is the joint Hessian-vector product for argument xk.
cache must be the output of prepare_hvp_cache, and f must be the same function object used to construct cache. All x arguments must have the same sizes and element types as used to construct the cache.
cache owns the mutable state in the returned values. Take a copy before calling again if you need to retain previous results.
HVPCache is not safe for concurrent reuse across threads. Use a separate cache per task/thread if calls may overlap in time.
f(x) = sum(x .* x)
x = [1.0, 2.0]
cache = Mooncake.prepare_hvp_cache(f, x)
f_val, gradient, hvp = Mooncake.value_and_hvp!!(cache, f, [1.0, 0.0], x)
f_val ≈ 5.0 && gradient ≈ [2.0, 4.0] && hvp ≈ [2.0, 0.0]
# output
trueMooncake.prepare_hessian_cache — Function
prepare_hessian_cache(f, x...; config=Mooncake.Config())Return a cache for computing f(x...), gradients ∇f, and the Hessian (or Hessian blocks) of f via value_gradient_and_hessian!!. Returns an HVPCache, which can also be used directly with value_and_hvp!!.
prepare_hessian_cache reuses the generic HVP cache builder. It eagerly checks only that at least one x argument was provided; validation that the x... inputs are AbstractVectors of IEEE floats, all with the same element type, is deferred to value_gradient_and_hessian!!.
Hessian computation uses forward-over-reverse AD: one forward-mode pass per input dimension over the reverse-mode gradient function.
f(x) = sum(x .^ 2)
x = [1.0, 2.0, 3.0]
cache = Mooncake.prepare_hessian_cache(f, x)
Mooncake.value_gradient_and_hessian!!(cache, f, x)
# output
(14.0, [2.0, 4.0, 6.0], [2.0 0.0 0.0; 0.0 2.0 0.0; 0.0 0.0 2.0])Mooncake.value_gradient_and_hessian!! — Function
value_gradient_and_hessian!!(cache::HVPCache, f, x...)Using a pre-built cache (from prepare_hessian_cache or prepare_hvp_cache), compute and return (value, gradient, hessian) of f.
Single argument: returns (f(x), ∇f(x), ∇²f(x)) — value, gradient vector, Hessian matrix.
Multiple arguments: returns (f(x1,...), (∇_x1 f, ∇_x2 f, ...), H_blocks) where H_blocks[k][j] is the nk × nj matrix ∂²f/∂xk∂xj. The return structure differs from the single-argument case.
Uses forward-over-reverse AD: one forward-mode pass per total input dimension.
cache must be the output of prepare_hessian_cache or prepare_hvp_cache, and f must be the same function object used to construct cache. All x arguments must have the same sizes and element types as used to construct the cache. The current implementation supports only AbstractVectors of IEEE floats, with all arguments sharing the same element type. This restriction comes from the Hessian assembly logic, which sweeps a standard basis of tangent vectors and materialises dense matrix / block-matrix outputs. For non-vector inputs, use value_and_hvp!! to obtain second-order directional derivatives without forming a full Hessian.
HVPCache is not safe for concurrent reuse across threads. Use a separate cache per task/thread if calls may overlap in time.
Example
f(x) = (1 - x[1])^2 + 100 * (x[2] - x[1]^2)^2
x = [1.2, 1.2]
cache = Mooncake.prepare_hessian_cache(f, x)
_, _, H = Mooncake.value_gradient_and_hessian!!(cache, f, x)
H
# output
2×2 Matrix{Float64}:
1250.0 -480.0
-480.0 200.0