Autograd
Automatic differentiation in RUMUS. The autograd engine records operations on an append-only tape and replays them in reverse to compute gradients — no manual calculus required.
How It Works
RUMUS implements reverse-mode automatic differentiation using a Wengert list (also called a tape). Every differentiable operation appends a BackwardOp entry to the tape during the forward pass. The backward pass then walks the tape in reverse topological order, accumulating gradients into a GradientStore.
"token-comment">// RUMUS uses a Wengert list(append-only tape) to record
"token-comment">// every differentiable operation as it executes. During the
"token-comment">// backward pass, the tape is replayed in reverse topological
"token-comment">// order to accumulate gradients.
"token-comment">//
"token-comment">// forward: x → y = relu(x) → loss = mse(y, target)
"token-comment">// tape: [ Relu { input_id, output_id },
"token-comment">// MseLoss { pred_id, target_id, output_id } ]
"token-comment">// backward: walk tape in reverse topo order, accumulate gradsRecording Operations
Tape recording is automatic. Any operation on a tracked tensor (one with AutogradState::Tracked) appends its corresponding BackwardOp variant to the global tape. You never interact with the tape directly.
use rumus::tensor::type">Tensor;
use rumus::nn;
"token-comment">// Operations are recorded automatically when tensors
"token-comment">// have AutogradState::Tracked set.
let x = type">Tensor::new(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2]);
let w = type">Tensor::new(vec![0.1, 0.2, 0.3, 0.4], vec![2, 2]);
"token-comment">// Each of these records a BackwardOp on the tape:
let h = x.matmul(&w); "token-comment">// records BackwardOp::Matmul
let a = nn::relu(&h); "token-comment">// records BackwardOp::Relu
let loss = a.mul(&a); "token-comment">// records BackwardOp::MulComputing Gradients
Call autograd::backward(&loss) to trigger the backward pass. This returns a GradientStore — essentially a HashMap<GradId, Tensor> — containing the gradient for every tracked tensor that contributed to the loss.
use rumus::autograd;
"token-comment">// Compute all gradients in one call.
"token-comment">// Returns a type">GradientStore: HashMap<GradId, type">Tensor>
let mut grads = autograd::backward(&loss)?;
"token-comment">// Access individual gradients by parameter grad_id.
"token-comment">// remove() takes ownership — each gradient is consumed
"token-comment">// exactly once(useful for the optimizer step).
let grad_w = grads.remove(w.grad_id().unwrap());
let grad_x = grads.remove(x.grad_id().unwrap());Inference Mode
The no_grad() function returns an RAII guard that disables tape recording for its entire lifetime. This is essential during inference (no need to compute gradients) and during optimizer parameter updates (you do not want the update step itself to be differentiated).
RAII pattern: The guard is stack-allocated. When it goes out of scope, tape recording automatically resumes. No manual cleanup needed — Rust enforces this at compile time.
use rumus::autograd::no_grad;
"token-comment">// no_grad() returns an RAII guard. While the guard is
"token-comment">// alive, ALL tensor operations bypass tape recording
"token-comment">// entirely — no BackwardOps are created, no grad_ids
"token-comment">// are assigned. This is critical for inference.
{
let _guard = no_grad();
"token-comment">// These operations are NOT recorded on the tape.
let pred = model.forward(&input);
println!("prediction: {:?}", pred.data());
}
"token-comment">// Guard dropped — tape recording resumes.
"token-comment">// This is also useful during the optimizer step:
"token-comment">// you don't want parameter updates to be tracked.
{
let _guard = no_grad();
"token-comment">// param -= lr * grad(not recorded)
}BackwardOp Variants
Unlike frameworks that store closures (boxed trait objects) on each graph node, RUMUS uses a concrete enum BackwardOp with 16 variants. Each variant carries exactly the data needed to compute its local Jacobian — no more, no less. This makes the tape inspectable, serializable, and allocation-free.
"token-comment">// RUMUS uses a concrete enum instead of boxed closures
"token-comment">// for backward operations. This is a deliberate design
"token-comment">// choice with several advantages:
"token-comment">//
"token-comment">// 1. No heap allocation per operation
"token-comment">// 2. Pattern-matchable — easy to inspect the tape
"token-comment">// 3. Serializable — you can save/load computation graphs
"token-comment">// 4. No lifetime issues with captured references
"token-comment">//
"token-comment">// The 16 BackwardOp variants:
"token-comment">//
"token-comment">// Arithmetic: Add, Sub, Mul, Matmul
"token-comment">// Activations: Relu
"token-comment">// Losses: MseLoss, CrossEntropyLoss
"token-comment">// Conv/Pool: Im2Col, type">MaxPool2d
"token-comment">// ... and more
"token-comment">//
"token-comment">// Compare this to PyTorch, which stores Python closures
"token-comment">// on each graph node — powerful but opaque and hard to
"token-comment">// serialize.
enum BackwardOp {
Add { lhs_id: GradId, rhs_id: GradId },
Matmul { lhs_id: GradId, rhs_id: GradId,
lhs_shape: type">Vec<type">usize>, rhs_shape: type">Vec<type">usize> },
Relu { input_id: GradId, output_snapshot: VersionSnapshot },
MseLoss { pred_id: GradId, target_id: GradId },
"token-comment">// ... 12 more variants
}Architecture Details
The backward pass uses Kahn's algorithmto topologically sort the computation graph before processing. This ensures that when a node's gradient is computed, all downstream contributions have already been accumulated.
Why topological sort matters: Consider the expression y = x + x. The tensor x appears twice in the Add node. Without proper ordering and edge counting, you would compute dy/dx = 1 instead of the correct dy/dx = 2.
VersionSnapshot: Some backward ops (like Relu) need forward-pass values to compute gradients. RUMUS stores Weak references via VersionSnapshot, so dropped tensors do not leak memory.
"token-comment">// --- Backward Pass: Kahn's Algorithm ---
"token-comment">//
"token-comment">// The tape is a flat list, but the computation graph is
"token-comment">// a DAG. RUMUS uses Kahn's algorithm(BFS topological
"token-comment">// sort) to determine the correct backward order:
"token-comment">//
"token-comment">// 1. Count in-edges for each node(how many ops
"token-comment">// consume this tensor as input).
"token-comment">// 2. Start from the loss node(zero out-edges).
"token-comment">// 3. Process nodes whose in-edge count reaches zero.
"token-comment">// 4. Accumulate gradients via type">GradientStore.
"token-comment">//
"token-comment">// This correctly handles diamond patterns like x + x,
"token-comment">// where a single tensor appears in multiple operations.
"token-comment">// --- Strict Edge Counting ---
"token-comment">//
"token-comment">// An AtomicUsize `total_grads` counter ensures that
"token-comment">// expressions like `x.add(&x)` count both uses of x.
"token-comment">// Without this, the gradient for x would be halved.
"token-comment">// --- VersionSnapshot with Weak Refs ---
"token-comment">//
"token-comment">// type">Some backward ops need the forward-pass tensor values
"token-comment">// (e.g., Relu needs to know which elements were zeroed).
"token-comment">// VersionSnapshot stores a Weak<...> reference so that
"token-comment">// if the original tensor is dropped, the snapshot doesn't
"token-comment">// keep dead memory alive — it simply fails gracefully.