Optimizers
First-class optimizer implementations with borrow-safe gradient consumption, GPU-fused kernels, LR scheduling, gradient clipping, and a high-level Trainer API.
Optimizer Trait
All optimizers implement a single trait. The step method takes a mutable reference to the gradient store, draining only the gradients for its registered parameters. This design eliminates overlapping borrows at compile time.
pub trait type">Optimizer {
fn step(
&mut self,
grads: &mut type">GradientStore,
) -> type">Result<(), type">AutogradError>;
}SGD
Stochastic Gradient Descent with optional momentum. The simplest optimizer, ideal for convex problems and as a baseline for experiments.
use rumus::optim::type">SGD;
let params = model.parameters();
let mut optimizer = type">SGD::new(params, 0.01); "token-comment">// lr = 0.01
"token-comment">// With optional momentum
let mut optimizer = type">SGD::new(params, 0.01)
.momentum(0.9);Adam
Adaptive moment estimation. Maintains per-parameter first and second moment buffers for adaptive learning rates. The default choice for most deep learning tasks.
use rumus::optim::type">Adam;
let params = model.parameters();
let mut optimizer = type">Adam::new(params, 0.001); "token-comment">// lr = 0.001
"token-comment">// type">Adam maintains first and second moment buffers
"token-comment">// internally for each parameter, providing adaptive
"token-comment">// per-parameter learning rates.AdamW
AdamW applies decoupled weight decay regularization, fixing the weight decay behavior of the original Adam optimizer. In RUMUS, AdamW runs GPU-fused kernels with zero host-device round-trips for the update step.
use rumus::optim::type">AdamW;
let params = model.parameters();
let mut optimizer = type">AdamW::new(params, 0.001); "token-comment">// lr = 0.001
"token-comment">// type">AdamW uses decoupled weight decay regularization
"token-comment">// and runs GPU-fused kernels — zero host-device
"token-comment">// round-trips for the optimizer step.Borrow Safety
RUMUS optimizers are designed around Rust's ownership model. The step(&mut grads) method drains only the ParamIds registered to that optimizer from the gradient store. This means no overlapping mutable borrows can occur, and the compiler verifies safety at build time.
"token-comment">// Borrow safety: step(&mut grads) drains only
"token-comment">// the registered ParamIds from the gradient store.
"token-comment">// No overlapping mutable borrows occur.
let mut grads = loss.backward();
"token-comment">// Safe: optimizer only touches its own parameters
optimizer.step(&mut grads)?;
"token-comment">// grads still exists — other ParamIds are untouchedGPU-Fused Training
When parameters reside on the GPU, optimizer kernels run entirely as WGSL compute shaders. Weight updates, moment buffer updates, and weight decay are fused into a single GPU dispatch — no data is copied back to the host.
"token-comment">// GPU-fused training: optimizer kernels execute
"token-comment">// entirely on the GPU — no host-device round-trips.
let params = model.parameters();
let mut optimizer = type">AdamW::new(params, 0.001);
"token-comment">// When model is on GPU, optimizer.step() dispatches
"token-comment">// fused WGSL compute shaders that update weights,
"token-comment">// moments, and apply decay in a single GPU pass.
model.to_gpu();
let mut grads = loss.backward();
optimizer.step(&mut grads)?; "token-comment">// runs on GPULR Schedulers
Learning rate schedulers adjust the learning rate across epochs. All schedulers implement the LRScheduler trait. Schedulers do not own the optimizer — users read the current LR and apply it via optimizer.set_lr(scheduler.get_lr()).
pub trait LRScheduler {
"token-comment">/// Advance the scheduler by one epoch
fn step(&mut self);
"token-comment">/// Return the current learning rate
fn get_lr(&self) -> type">f32;
"token-comment">/// Override the initial learning rate
fn set_initial_lr(&mut self, lr: type">f32);
}StepLR
Multiplies the learning rate by gamma every step_size epochs. Produces a staircase decay pattern.
use rumus::optim::lr_scheduler::StepLR;
"token-comment">// Multiply LR by gamma every step_size epochs
"token-comment">// Formula: lr = initial_lr * gamma^(epoch / step_size)
let mut scheduler = StepLR::new(
0.1, "token-comment">// initial_lr
30, "token-comment">// step_size — decay every 30 epochs
0.1, "token-comment">// gamma — multiply by 0.1 each time
);
"token-comment">// epoch 0-29: lr = 0.1
"token-comment">// epoch 30-59: lr = 0.01
"token-comment">// epoch 60-89: lr = 0.001CosineAnnealingLR
Smooth cosine decay from initial_lr down to eta_min over t_max epochs. Preferred for training runs where gradual warmdown improves final accuracy.
use rumus::optim::lr_scheduler::CosineAnnealingLR;
"token-comment">// Smooth cosine decay over t_max epochs
"token-comment">// Formula: lr = eta_min + 0.5 * (initial_lr - eta_min)
"token-comment">// * (1 + cos(pi * epoch / t_max))
let mut scheduler = CosineAnnealingLR::new(
0.1, "token-comment">// initial_lr
100, "token-comment">// t_max — full cosine period
1e-6, "token-comment">// eta_min — minimum learning rate
);Applying a Scheduler
Because schedulers are decoupled from optimizers, you control exactly when and how the learning rate is updated.
"token-comment">// Schedulers don't own the optimizer — users
"token-comment">// apply the LR manually each epoch.
for epoch in 0..num_epochs {
optimizer.set_lr(scheduler.get_lr());
for batch in &dataloader {
"token-comment">// ... training step ...
}
scheduler.step(); "token-comment">// advance after each epoch
}Gradient Clipping
The clip_grad_norm_ function clips the global L2 norm of all gradients in-place and returns the original unclipped norm. This prevents exploding gradients during training.
use rumus::optim::clip_grad_norm_;
"token-comment">/// Clips the global L2 norm of all gradients.
"token-comment">/// Returns the unclipped total norm.
pub fn clip_grad_norm_(
grads: &mut type">GradientStore,
params: &[ParamId],
max_norm: type">f32,
) -> type">f32;GPU Strategy
On GPU tensors, gradient clipping uses a 3-pass strategy to minimize host-device synchronization: parallel norm reduction, a single CPU readback to compute the total norm, and a conditional scale dispatch only when clipping is needed.
"token-comment">// 3-pass GPU strategy for clip_grad_norm_:
"token-comment">//
"token-comment">// Pass 1: dispatch reduce_sum_sq kernel for each
"token-comment">// GPU gradient tensor in parallel
"token-comment">//
"token-comment">// Pass 2: read back per-tensor norms to CPU,
"token-comment">// compute total_norm = sqrt(sum of all)
"token-comment">//
"token-comment">// Pass 3: if total_norm > max_norm, dispatch scale
"token-comment">// kernel on each gradient:
"token-comment">// grad *= max_norm / total_normUsage
Call clip_grad_norm_ after backward() and before optimizer.step().
let mut grads = loss.backward();
"token-comment">// Clip gradients before the optimizer step
let total_norm = clip_grad_norm_(
&mut grads,
&model.parameters(),
1.0, "token-comment">// max_norm
);
println!("Gradient norm: {:.4}", total_norm);
optimizer.step(&mut grads)?;DataLoader
The DataLoader batches and shuffles data from any type implementing the Dataset trait. It supports multithreaded prefetching with bounded mpsc channels for backpressure and graceful Drop teardown of worker threads.
Dataset Trait
Datasets must be Send + Sync so worker threads can access them concurrently. Each call to get(index) returns a DataItem containing an input tensor and a target tensor.
"token-comment">/// Must be Send + Sync for multi-threaded loading
pub trait Dataset: Send + Sync {
fn len(&self) -> type">usize;
fn get(&self, index: type">usize) -> DataItem;
}
pub struct DataItem {
pub input: type">Tensor,
pub target: type">Tensor,
}Creating a DataLoader
Shuffle uses Fisher-Yates per epoch. The iter() method returns a one-epoch iterator that yields batched DataItems with stacked tensors along the batch dimension.
use rumus::data::{DataLoader, Dataset};
let loader = DataLoader::new(
dataset, "token-comment">// impl Dataset
32, "token-comment">// batch_size
true, "token-comment">// shuffle — Fisher-Yates per epoch
false, "token-comment">// drop_last — drop incomplete final batch
4, "token-comment">// num_workers — prefetch threads
2, "token-comment">// prefetch_factor — batches per worker
);
"token-comment">// Iterate one epoch
for batch in loader.iter() {
"token-comment">// batch.input: [batch_size, ...]
"token-comment">// batch.target: [batch_size, ...]
}Single-Thread Debugging
Set num_workers=0 to run all loading on the calling thread. This makes it straightforward to step through data loading with a debugger.
"token-comment">// Use num_workers=0 for single-threaded debugging.
"token-comment">// All batching happens on the calling thread — easier
"token-comment">// to step through with a debugger.
let loader = DataLoader::new(
dataset, 32, true, false,
0, "token-comment">// single-thread mode
1,
);RecordWriter / RecordDataset
For massive datasets, RUMUS provides a binary .rrec format with O(1) random access via a trailing index. RecordWriter::new(path) creates a new file, .append(input, target) writes samples, and .finish() flushes the index footer. RecordDataset::open(path) implements the Dataset trait using memmap2 for lock-free concurrent reads, bypassing filesystem bottlenecks entirely.
use rumus::data::{RecordWriter, RecordDataset, DataLoader};
use std::sync::Arc;
let mut writer = RecordWriter::new("train.rrec")?;
writer.append(&input, &target)?;
writer.finish()?;
let dataset = Arc::new(RecordDataset::open("train.rrec")?);
let loader = DataLoader::new(dataset, 32, true, false, 4, 2);Full Training Loop
The Trainer struct provides a high-level API that wraps forward pass, backward pass, and optimizer step into a single call. Combined with a DataLoader, CosineAnnealingLR scheduler, and clip_grad_norm_, this is a production-ready training loop.
use rumus::optim::{type">Trainer, type">AdamW, clip_grad_norm_};
use rumus::optim::lr_scheduler::CosineAnnealingLR;
use rumus::data::DataLoader;
use rumus::nn;
"token-comment">// Model & optimizer
let mut optimizer = type">AdamW::new(model.parameters(), 3e-4);
let mut trainer = type">Trainer::new(optimizer);
"token-comment">// LR scheduler — cosine decay over 100 epochs
let mut scheduler = CosineAnnealingLR::new(3e-4, 100, 1e-6);
"token-comment">// DataLoader — 4 workers, prefetch 2 batches each
let loader = DataLoader::new(dataset, 64, true, true, 4, 2);
for epoch in 0..100 {
trainer.optimizer_mut().set_lr(scheduler.get_lr());
for batch in loader.iter() {
trainer.train_step_with(|grads, params| {
"token-comment">// Clip gradients before optimizer step
clip_grad_norm_(grads, params, 1.0);
}, || {
let logits = model.forward(&batch.input);
nn::cross_entropy_loss(&logits, &batch.target)
});
}
scheduler.step();
println!(
"Epoch {:3} | lr {:.6} | loss {:.4}",
epoch,
scheduler.get_lr(),
trainer.epoch_avg_loss(),
);
}