Reverse-Mode Design

Last checked: 04/04/2026, Julia v1.10 / v1.11 / v1.12.

This page gives a high-level map of how Mooncake's reverse-mode transform is structured. It is aimed at readers who want to understand the main ideas before reading the implementation.

High-Level Transform: `IRCode` to `IRCode`

The reverse-mode pipeline is easiest to understand as a four-step transform:

Start from normalized primal IRCode.
Convert that IRCode into Mooncake's builder-local CFG representation.
Assemble new forward and reverse CFGs.
Lower each assembled CFG back to compiler IRCode.

So reverse mode is no longer "edit compiler IR in place until it works". The compiler IRCode is the input and output format, but most of the reverse-mode assembly happens in the middle on CFGBlocks with Mooncake IDs.

1. Normalize and reinterpret the primal `IRCode`

generate_ir begins by looking up the inferred primal IRCode, running normalise!, and converting the result into CFG blocks. At that point the source of truth is still the primal IRCode, but reverse mode stops working directly with compiler block numbers and SSA names.

The local CFG layer exists because reverse mode needs to introduce new control flow, not just rewrite statements. In particular, the pullback has to reconstruct which predecessor edge was taken through the primal CFG, including phi-sensitive cases.

2. Translate each primal statement into AD fragments

Each primal statement is translated by make_ad_stmts! into an ADStmtInfo. This is a small per-statement plan containing:

forward-pass instructions
reverse-pass instructions
an optional communication value that must survive from the forward run to the pullback

ADInfo stores the global transform state used while assembling those fragments: argument and SSA type tables, reverse-data references, shared-data bookkeeping, block-stack information for control-flow reconstruction, and debug-mode configuration.

The important design point is that statement translation does not directly build the final pullback IRCode. It produces forward and reverse fragments first, and whole CFGs are assembled afterward.

3. Assemble forward and reverse CFGs

forwards_pass_ir builds the forward closure CFG. This CFG:

loads shared captured state
initializes lazy zero-rdata bookkeeping
emits the translated forward statements for each primal block
pushes communication values needed later by the pullback
records block-stack information when the reverse pass must recover dynamic control flow

pullback_ir then builds the pullback CFG separately. This CFG:

loads shared data and reverse-data references
dispatches to the block that actually exited during the forward run
walks primal statements in reverse order
reconstructs predecessor-sensitive control flow
handles phi nodes by routing through edge-specific reverse blocks when needed
materializes the final cotangent tuple returned by the pullback closure

This is why the pullback is more than "run the statements backwards". It also has to replay the control-flow structure needed to send cotangents back along the correct incoming edges.

4. Lower the builder CFG back to compiler `IRCode`

Once the forward and reverse CFGs are assembled, lower_cfg_blocks_to_ir turns them back into coherent compiler IRCode. This step handles the mechanical reconstruction work in one place:

canonicalizing the local CFG
pruning unreachable blocks
lowering switch-style control flow into compiler-compatible terminators
rebuilding SSA numbering and block numbering
constructing a fresh Core.Compiler.IRCode

After that, generate_ir can run the usual optimization pass on both results and wrap them into opaque closures for the final derived reverse rule.

In short, the reverse-mode transform is:

primal IRCode
  -> normalized primal IRCode
  -> builder-local CFG
  -> forward CFG + pullback CFG
  -> forward IRCode + pullback IRCode

That middle CFG stage is what keeps the reverse-mode implementation manageable: it isolates the hard control-flow surgery from the compiler's concrete IRCode datastructure, and only lowers back to compiler IR once the new program structure is complete.

A Worked Mini-Example

Here is the smallest useful mental model for the whole reverse-mode pipeline. Consider a primal function with one active call followed by a return:

function f(x)
    y = sin(x)
    return y
end

Very roughly, the normalized primal IRCode looks like:

bb1:
    %1 = sin(_2)
    return %1

Step 1: statement translation

make_ad_stmts! translates the call and the return into forward and reverse fragments.

For the call, the important effect is:

forward:
    %rule_result = rule_for_sin(x_arg)
    %pb = getfield(%rule_result, 2)
    %1 = getfield(%rule_result, 1)

reverse:
    %d1 = rdata_ref_for_%1[]
    rdata_ref_for_%1[] = zero(...)
    %dx = %pb(%d1)
    increment_ref!(rdata_ref_for_x, getfield(%dx, 1))

For the return, the important effect is:

forward:
    return %1

reverse:
    increment_ref!(rdata_ref_for_%1, dy)

Step 2: forward CFG assembly

forwards_pass_ir wraps those fragments in a forward CFG with an extra entry block.

Conceptually:

fwd_entry:
    load shared captures
    initialize lazy zero-rdata state
    goto fwd_bb1

fwd_bb1:
    %rule_result = ...
    %pb = ...
    %1 = ...
    push!(comms_stack, tuple(%pb))
    return %1

The forward closure therefore does two things at once:

computes the primal/codual result
stores the information that the pullback will need later

Step 3: pullback CFG assembly

pullback_ir builds a separate pullback CFG:

rvs_entry:
    load shared captures
    create reverse-data refs
    goto rvs_bb1

rvs_bb1:
    %pb = getfield(pop!(comms_stack), 1)
    increment_ref!(rdata_ref_for_%1, dy)
    %d1 = rdata_ref_for_%1[]
    rdata_ref_for_%1[] = zero(...)
    %dx = %pb(%d1)
    increment_ref!(rdata_ref_for_x, getfield(%dx, 1))
    goto rvs_exit

rvs_exit:
    read argument rdata refs
    instantiate any lazy zero-rdata placeholders
    return argument cotangent tuple

This is the essential pattern repeated across larger examples: the pullback is a separate CFG that consumes stored forward-pass data and sends cotangents backward through the primal dependency structure.

Step 4: lower back to compiler IR

Finally, lower_cfg_blocks_to_ir converts both CFGs back to ordinary compiler IRCode.

So even in this tiny example the real flow is:

primal IRCode
  -> translated statement fragments
  -> forward CFG / pullback CFG
  -> forward IRCode / pullback IRCode

These examples use schematic names such as x_arg rather than exact lowered argument slots. In the real generated IR, argument indices can shift because the generated closures carry extra state in addition to the primal arguments.

`CFGBlock`: The Reverse-Mode Working IR

CFGBlock is Mooncake's reverse-mode-local basic-block representation. It is the format used while assembling the forward closure CFG and the pullback CFG inside reverse_mode.jl.

At a high level, a CFGBlock is just:

a stable internal block ID
a vector of (ID, NewInstruction) pairs for the statements in that block

Those IDs are Mooncake-local IDs, not compiler SSA numbers or compiler block numbers. That is deliberate: reverse mode needs a representation that can survive block insertion, block splitting, and control-flow rewrites without constantly renumbering compiler SSA values.

What `CFGBlock` is used for

The reverse transform uses CFGBlocks to do the hard structural work before lowering back to compiler IR. In particular, the builder-local CFG is where Mooncake:

translates primal SSA statements into forward and reverse fragments
inserts extra entry and exit blocks
groups and restores communication values for the pullback
rewrites control flow into predecessor-sensitive reverse dispatch
creates edge-specific reverse blocks for phi handling
prunes unreachable blocks and canonicalizes the resulting CFG

So CFGBlock is not an alternative public IR for Mooncake as a whole. It is a local assembly format used to construct reverse-mode programs safely before converting them back into Core.Compiler.IRCode.

Why `IRCode` is not sufficient as the working format

IRCode is sufficient as the source and target format. It is not a good format for the middle of reverse-mode assembly.

The core problem is that reverse mode does more than local statement replacement. It often has to:

create fresh blocks
insert control-flow that did not exist in the primal
split one logical reverse step across several blocks
route cotangents along predecessor-specific edges
keep phi handling consistent with those edges
rebuild SSA numbering and CFG numbering coherently at the end

Compiler IRCode stores statements and CFG structure in tightly related forms. If you edit it mid-assembly, you have to keep the instruction stream, block numbering, terminators, phi edges, and CFG metadata coherent at every intermediate step. That is possible for local edits, but it becomes brittle once reverse mode starts inserting whole blocks and rethreading control flow.

CFGBlock avoids that problem by giving reverse mode a looser construction format:

block identity is stable while the transform is running
instructions can be inserted without immediate SSA renumbering
predecessor and successor rewrites can happen before final lowering
phi-edge handling can be expressed directly in terms of predecessor block IDs

Only once the new forward and reverse CFGs are structurally complete does Mooncake lower them back to IRCode, rebuild block/SSA numbering, and hand the result back to the compiler.

The right way to think about it

IRCode is the compiler-facing representation. CFGBlock is the reverse-mode assembly representation.

Mooncake still begins with normalized primal IRCode and ends with compiler IRCode, but it does the difficult control-flow surgery in CFGBlock form because that is the point where the transform needs flexibility more than compiler-format exactness.

Data Structures That Matter Most

Two data structures carry most of the transform's state:

ADInfo: global state shared across the whole derivation
ADStmtInfo: the per-statement translation result

`ADInfo`

When reading the implementation, the most important ADInfo fields are:

shared_data_pairs: the values captured by both generated closures
block_stack_id and block_stack: the control-flow replay channel
arg_rdata_ref_ids and ssa_rdata_ref_ids: where reverse data is accumulated
ssa_insts and arg_types: the primal type information used during translation
lazy_zero_rdata_ref_id: the placeholder-zero mechanism used at pullback exit

The remaining fields mostly support those jobs rather than introducing separate ideas.

`ADStmtInfo`

ADStmtInfo is simpler. Its central fields are:

fwds: forward-pass instructions for the primal statement
rvs: reverse-pass instructions for that statement
comms_id: the optional value that must survive from forward execution to pullback execution

If you understand those three fields, you understand the role of ADStmtInfo.

Statement-Level MWEs

The reverse-mode transform is easiest to follow if you read make_ad_stmts! as a translator from one primal SSA statement into:

the forward statements that execute the primal computation and save anything the pullback will need later
the reverse statements that consume cotangents and propagate them to arguments and earlier SSA values

These are only sketches, but they match the current implementation strategy closely. They use pseudocode-style helper names such as rule_for_sin, increment_ref!, and switch_to_reverse_phi_edge(...) to show the dataflow. They are not literal emitted APIs or exact compiler IR.

MWE 1: Constant literal

Primal statement:

%5 = 3.0

Forward fragment:

%5 = uninit_fcodual(3.0)

Reverse fragment:

nothing

Constants produce a CoDual on the forward pass, but there is nothing to propagate on the reverse pass because Mooncake does not track cotangents for literals.

MWE 2: Active branch condition

Primal statement:

goto #3 if not %2

If %2 is active, then %2 is a CoDual in the forward IR, so the condition must branch on its primal:

Forward fragment:

%cond = primal(%2)
goto #3 if not %cond

Reverse fragment:

nothing

The branch itself carries no cotangent information. Its effect on the reverse pass is indirect: the forward pass records enough control-flow information for the pullback to recover which edge was taken.

MWE 3: Active call

Primal statement:

%7 = sin(%4)

Very roughly, the forward fragment becomes:

%rule_result = rule_for_sin(%4)
%pb = getfield(%rule_result, 2)
%raw_y = getfield(%rule_result, 1)
%7 = typeassert(%raw_y, CoDual{Float64, ...})

and the reverse fragment becomes:

%dy = rdata_ref_for_%7[]
rdata_ref_for_%7[] = zero_like_rdata_from_type(typeof(%7))
%dx_tuple = %pb(%dy)
increment_ref!(rdata_ref_for_%4, getfield(%dx_tuple, 1))

The exact generated code is more explicit than this because Mooncake writes out the getfield, setfield!, and increment!! operations directly. That keeps the reverse-data references visible to the optimizer.

The important point is that the forward fragment computes both the output codual and a pullback object, and the reverse fragment later consumes the stored pullback plus the output cotangent.

MWE 4: Return of an active SSA value

Primal statement:

return %9

Forward fragment:

%checked = typeassert(%9, fwd_ret_type)
return %checked

Reverse fragment:

increment_ref!(rdata_ref_for_%9, dy)

The forward closure returns the codual value. The pullback closure receives the cotangent of that return value and accumulates it into the reverse-data reference associated with %9.

MWE 5: Phi nodes

Consider a primal block that starts with a phi node:

3 ┄ %6 = φ (#1 => arg_x, #2 => %5)

On the forward pass, this is mostly structural. The phi node is rebuilt so that its incoming values are coduals rather than raw primal values:

3 ┄ %6 = φ (#1 => codual_arg_x, #2 => %5)

Schematically, the incoming primal argument is replaced by the corresponding codual argument in the generated forward closure. In the real lowered IR, the exact argument slot can shift because the generated closure carries extra leading state.

The important work happens on the reverse side, not at the phi statement itself. Suppose the cotangent for %6 is stored in r%6. When the pullback reaches the reverse counterpart of this block, it first extracts and zeros the cotangent for %6, then dispatches to an edge-specific reverse block:

%d6 = r%6[]
r%6[] = zero_like_rdata_from_type(typeof(%6))
switch_to_reverse_phi_edge(...)

If the primal arrived from block #1, the reverse phi-edge block behaves roughly like:

increment_ref!(r_arg_x, %d6)
goto reverse_block_for_#1

If the primal arrived from block #2, it behaves roughly like:

increment_ref!(r%5, %d6)
goto reverse_block_for_#2

So the key point is that phi handling is split across two places:

the forward pass rebuilds the phi with codual inputs
the reverse pass routes cotangents back through predecessor-specific phi-edge blocks

This is why the pullback needs predecessor reconstruction, rather than just local statement reversal.

How Control Flow Is Handled

Reverse mode cannot treat control flow as a local statement rewrite problem. The forward pass must preserve enough information for the pullback to know:

which block actually executed next
which predecessor edge reached a join block
which reverse block should run after the current one finishes

That is why reverse mode assembles whole CFGs and uses the block stack when the predecessor is not statically determined.

MWE 1: Straight-line fallthrough

Suppose the primal CFG is just:

bb1 -> bb2 -> return

and bb2 has only one predecessor, namely bb1.

In this case there is no ambiguity. The forward pass does not need to log any predecessor ID for bb2, and the reverse pass can jump directly from the reverse counterpart of bb2 to the reverse counterpart of bb1.

Very roughly:

forward:   fwd_bb1 ; fwd_bb2 ; return
reverse:   rvs_bb2 ; goto rvs_bb1

This is the cheap path: no block-stack push, no block-stack pop, no predecessor switch.

MWE 2: Simple conditional branch

Suppose the primal CFG is:

bb1:
    if cond goto bb2 else bb3

bb2:
    ...
    goto bb4

bb3:
    ...
    goto bb4

bb4:
    return y

On the forward pass, execution may reach bb4 from either bb2 or bb3. That predecessor is not statically known, so the forward pass records enough control-flow information to recover it later.

Conceptually the forward pass does something like:

fwd_bb1:
    branch on primal(cond)

fwd_bb2:
    push!(block_stack, bb2)
    ...
    goto fwd_bb4

fwd_bb3:
    push!(block_stack, bb3)
    ...
    goto fwd_bb4

Then, once the pullback reaches the reverse counterpart of bb4, it uses make_switch_stmts to recover which predecessor actually led there:

rvs_bb4:
    prev = pop!(block_stack)
    if prev == bb2
        goto rvs_phi_or_pred_for_bb2
    else
        goto rvs_phi_or_pred_for_bb3

So the pullback does not "re-run the branch condition". It replays the realized control-flow path from the forward run.

MWE 3: Join block with phi node

Now combine branching with a phi node:

bb2:
    %5 = ...
    goto bb4

bb3:
    goto bb4

bb4:
    %6 = φ (#2 => %5, #3 => arg_x)
    return %6

The forward pass computes %6 in the usual SSA sense. The reverse pass has to do two things at the join:

determine whether control came from bb2 or bb3
send the cotangent of %6 back to %5 or arg_x accordingly

So the reverse CFG for bb4 behaves roughly like:

rvs_bb4:
    d6 = r%6[]
    r%6[] = zero(...)
    prev = pop!(block_stack)
    if prev == bb2
        goto rvs_phi_edge_bb2
    else
        goto rvs_phi_edge_bb3

rvs_phi_edge_bb2:
    increment_ref!(r%5, d6)
    goto rvs_bb2

rvs_phi_edge_bb3:
    increment_ref!(r_arg_x, d6)
    goto rvs_bb3

This is the precise place where control-flow replay and phi handling meet.

MWE 4: Unique-predecessor optimization

Mooncake does not always push block IDs. If a block's predecessor is uniquely determined, the reverse pass can hard-code that predecessor instead of reading the block stack.

For example:

bb1 -> bb2
bb2 -> bb3
bb3 -> return

If bb3 can only ever be reached from bb2, then the reverse pass for bb3 can go straight to bb2. No stack traffic is needed for that edge.

This optimization matters because dynamic control-flow logging is only needed where the forward execution path loses information that the pullback must recover later.

Entry and Exit Blocks

Both generated closures contain extra structural blocks that do not correspond directly to a single primal block.

Forward entry block

The forward entry block is responsible for:

loading shared captured data
initializing lazy zero-rdata placeholders for arguments
optionally logging the synthetic entry block for later reverse dispatch

Pullback entry block

The pullback entry block is responsible for:

loading the same shared captured data
creating reverse-data references for arguments and SSA values
dispatching to the reverse block associated with the primal block that actually returned

Pullback exit block

The pullback exit block is responsible for:

reading argument reverse-data references
materializing true zeros from lazy zero-rdata placeholders where needed
packaging the final argument cotangent tuple
returning that tuple with the expected type

These blocks are worth calling out explicitly because they explain why the generated forward and reverse CFGs do not look like simple block-for-block reversals of the primal CFG.

Forward-to-Reverse Communication

Forward and reverse code do not communicate through a single channel. The current design uses three distinct mechanisms, each for a different kind of information.

1. Shared captured data

Some values are known statically when the derived rule is built and are needed by both the forward closure and the pullback closure. These go through SharedDataPairs.

Examples include:

rule objects that are not singleton values
constant coduals that are not safe to interpolate directly into IR
the per-transform block stack object
the lazy-zero-rdata reference used to materialize correctly typed zeros at pullback exit

Both generated closures receive the same captures tuple. At the start of each closure, shared_data_stmts extracts those captures back into local IDs.

This is the "static shared state" channel: build-time data, not per-execution data.

2. Per-block communication stacks

Some values are only known during the forward execution and must be handed to the pullback for the matching block. This is what the comms_id field of ADStmtInfo is for.

For a statement like an active call, the typical communicated value is the pullback object:

%rule_result = rule(...)
%pb = getfield(%rule_result, 2)

%pb is marked as the statement's comms_id. Later, create_comms_insts! groups the comms_ids for a primal block, builds a tuple of them on the forward pass, and pushes that tuple onto a block-local stack:

%tuple = tuple(%pb1, %pb2, ...)
push!(comms_stack, %tuple)

At the start of the corresponding reverse block, the pullback pops that tuple and restores the saved values to the same IDs:

%tuple = pop!(comms_stack)
%pb1 = getfield(%tuple, 1)
%pb2 = getfield(%tuple, 2)

This is the "dynamic value" channel: values produced during the forward execution and consumed later by the reverse execution.

3. The block stack for dynamic control flow

The pullback has to know which predecessor edge the primal actually took. When that predecessor is not statically determined, the forward pass records block IDs on a dedicated BlockStack.

On the forward pass, selected blocks push their ID:

push!(block_stack, current_block_id)

On the reverse pass, make_switch_stmts pops the predecessor ID and dispatches accordingly:

%prev = pop!(block_stack)
switch(%prev == pred1 ? ..., %prev == pred2 ? ..., ...)

This is what allows the pullback to send cotangents along the correct incoming edge and to handle phi nodes via predecessor-specific reverse blocks.

In short, the three communication mechanisms are:

captures shared by both closures for static build-time data
per-block comms stacks for dynamic forward values such as pullback objects
the block stack for replaying dynamic control flow on the reverse pass

Concept-to-Helper Map

If you want to jump from the conceptual description in this page to the implementation, these are the main landmarks:

statement translation: make_ad_stmts!
shared captured data: SharedDataPairs, shared_data_tuple, shared_data_stmts
per-block forward-to-reverse values: create_comms_insts!
forward CFG assembly: forwards_pass_ir
pullback CFG assembly: pullback_ir
control-flow replay: __push_blk_stack!, __pop_blk_stack!, make_switch_stmts
phi-edge reverse routing: conclude_rvs_block, rvs_phi_block
local CFG representation: CFGBlock
lowering back to compiler IR: lower_cfg_blocks_to_ir

Where This Lives in Code

If you want to connect the conceptual story above to the implementation, the main entry points are:

build_rrule(interp::MooncakeInterpreter{C}, sig_or_mi; debug_mode=false)

generate_ir(
    interp::MooncakeInterpreter, sig_or_mi; debug_mode=false, do_inline=true
)

Here sig_or_mi is either a signature such as Tuple{typeof(foo), Float64} or a Core.MethodInstance.

If the signature has a custom rule (Mooncake.is_primitive returns true), Mooncake uses that rule. Otherwise it looks up the primal IR and differentiates it.

lookup_ir calls Core.Compiler.typeinf_ircode on a method instance, which is a lower-level version of Base.code_ircode.

The transform works on Core.Compiler.IRCode, not the CodeInfo shown by @code_typed. normalise! rewrites some IRCode expressions into forms that are easier for the AD transform to handle, after which reverse mode assembles through the local CFG builder in reverse_mode.jl and lowers back to IRCode.

Captures and Closure Construction

The generated forward and reverse IRCodes do not communicate by calling each other directly. They communicate partly through ordinary runtime values and partly through a shared captures tuple that is embedded into the generated closures.

What gets captured

The shared captures tuple contains the values recorded in SharedDataPairs, such as:

static rule objects that are not safe or convenient to inline directly
stacks used for per-block communication
the block stack used for control-flow replay
the lazy-zero-rdata reference used to finish the pullback result

generate_ir builds this tuple with:

shared_data = shared_data_tuple(info.shared_data_pairs)

and both generated closures are later constructed with that same tuple:

fwd_oc = misty_closure(dri.fwd_ret_type, dri.fwd_ir, dri.shared_data...)
rvs_oc = misty_closure(dri.rvs_ret_type, dri.rvs_ir, dri.shared_data...)

Both generated closures therefore receive access to the same logical captured data. At the start of each closure, shared_data_stmts lowers that tuple back into local IDs.

So the resulting IRCode does not contain the captured values directly as ordinary SSA definitions. Instead, it contains loads from the closure's captures field near the entry block.

MWE: one captured tuple shared by both closures

Suppose SharedDataPairs conceptually contains:

[(id_rule, some_rule), (id_blk, block_stack), (id_zero, lazy_zero_ref)]

Then:

shared_data_tuple(...) == (some_rule, block_stack, lazy_zero_ref)

and the entry blocks in the lowered IRCodes begin by reconstructing those bindings:

fwd_entry:
    id_rule = getfield(_1, 1)
    id_blk  = getfield(_1, 2)
    id_zero = getfield(_1, 3)
    ...

rvs_entry:
    id_rule = getfield(_1, 1)
    id_blk  = getfield(_1, 2)
    id_zero = getfield(_1, 3)
    ...

Here _1 is the generated closure object. The getfield operations shown above are the loads that recover values from that closure's captured state. The important point is that the two generated closures do not each get separate logical bindings. They are built against the same captured state assembled during derivation.

After lowering, the final forward IRCode still contains the machinery that writes dynamic data needed by the pullback:

pushes onto comms stacks for values such as pullback objects
pushes onto the block stack when control-flow replay is needed
initialization of lazy zero-rdata placeholders

The final pullback IRCode contains the matching reads:

pops from comms stacks
pops from the block stack
loads from reverse-data references
loads from the lazy-zero-rdata capture when finishing the returned cotangent tuple

So although the builder-local CFG disappears after lowering, the final IRCode still encodes the forward-to-reverse contract explicitly through stack operations, ref operations, and capture loads.

If an active call produces a pullback object %pb, the final forward IRCode contains code equivalent to:

%tuple = tuple(%pb)
push!(comms_stack, %tuple)

and the final pullback IRCode contains the matching restore:

%tuple = pop!(comms_stack)
%pb = getfield(%tuple, 1)

That pair of writes and reads is how a value produced only during the forward run becomes available later in the pullback.

How the closures are built

generate_ir produces two compiler IRCodes:

one for the forward closure
one for the pullback closure

build_derived_rrule then packages those into closure objects. At a high level:

generate_ir returns DerivedRuleInfo, containing fwd_ir, rvs_ir, and shared_data
build_derived_rrule turns each IRCode plus shared_data into a MistyClosure
those closures are placed into a DerivedRule
when the derived rule is called, it returns the forward result plus a Pullback wrapper around the reverse closure

The end result is that users do not interact with raw IRCode directly. They call a derived rule, which runs the generated forward closure and receives a callable pullback object whose captures point at the same shared state prepared during derivation.

MWE: final wrapper structure

Conceptually, the final derived rule looks like:

DerivedRule(
    fwds_oc = MistyClosure(fwd_ir, shared_data),
    pb_oc_ref = Ref(MistyClosure(rvs_ir, shared_data)),
    ...
)

Calling that derived rule:

runs fwds_oc
returns the forward CoDual
returns a Pullback object that knows how to call pb_oc_ref[]

So the forward closure and pullback closure are separate pieces of generated code, but they are stitched together by the shared captures tuple and the DerivedRule / Pullback wrappers.

many operations are only as good as the primitive or derived rules available for them
debug mode wraps rule calls and changes the generated code shape slightly for diagnostics
the transform depends on compiler IR conventions that can shift across Julia minor versions
reverse mode assumes ordinary SSA/control-flow lowering patterns, not the full exception-state machinery generated for every possible language feature