Optimizers

First-class optimizer implementations with borrow-safe gradient consumption, GPU-fused kernels, LR scheduling, gradient clipping, and a high-level Trainer API.

Optimizer Trait

All optimizers implement a single trait. The step method takes a mutable reference to the gradient store, draining only the gradients for its registered parameters. This design eliminates overlapping borrows at compile time.

rust
pub trait type">Optimizer {
    fn step(
        &mut self,
        grads: &mut type">GradientStore,
    ) -> type">Result<(), type">AutogradError>;
}

SGD

Stochastic Gradient Descent with optional momentum. The simplest optimizer, ideal for convex problems and as a baseline for experiments.

rust
use rumus::optim::type">SGD;

let params = model.parameters();
let mut optimizer = type">SGD::new(params, 0.01);  "token-comment">// lr = 0.01

"token-comment">// With optional momentum
let mut optimizer = type">SGD::new(params, 0.01)
    .momentum(0.9);

Adam

Adaptive moment estimation. Maintains per-parameter first and second moment buffers for adaptive learning rates. The default choice for most deep learning tasks.

rust
use rumus::optim::type">Adam;

let params = model.parameters();
let mut optimizer = type">Adam::new(params, 0.001);  "token-comment">// lr = 0.001

"token-comment">// type">Adam maintains first and second moment buffers
"token-comment">// internally for each parameter, providing adaptive
"token-comment">// per-parameter learning rates.

AdamW

AdamW applies decoupled weight decay regularization, fixing the weight decay behavior of the original Adam optimizer. In RUMUS, AdamW runs GPU-fused kernels with zero host-device round-trips for the update step.

rust
use rumus::optim::type">AdamW;

let params = model.parameters();
let mut optimizer = type">AdamW::new(params, 0.001);  "token-comment">// lr = 0.001

"token-comment">// type">AdamW uses decoupled weight decay regularization
"token-comment">// and runs GPU-fused kernels — zero host-device
"token-comment">// round-trips for the optimizer step.

Borrow Safety

RUMUS optimizers are designed around Rust's ownership model. The step(&mut grads) method drains only the ParamIds registered to that optimizer from the gradient store. This means no overlapping mutable borrows can occur, and the compiler verifies safety at build time.

rust
"token-comment">// Borrow safety: step(&mut grads) drains only
"token-comment">// the registered ParamIds from the gradient store.
"token-comment">// No overlapping mutable borrows occur.

let mut grads = loss.backward();

"token-comment">// Safe: optimizer only touches its own parameters
optimizer.step(&mut grads)?;

"token-comment">// grads still exists — other ParamIds are untouched

GPU-Fused Training

When parameters reside on the GPU, optimizer kernels run entirely as WGSL compute shaders. Weight updates, moment buffer updates, and weight decay are fused into a single GPU dispatch — no data is copied back to the host.

rust
"token-comment">// GPU-fused training: optimizer kernels execute
"token-comment">// entirely on the GPU — no host-device round-trips.

let params = model.parameters();
let mut optimizer = type">AdamW::new(params, 0.001);

"token-comment">// When model is on GPU, optimizer.step() dispatches
"token-comment">// fused WGSL compute shaders that update weights,
"token-comment">// moments, and apply decay in a single GPU pass.
model.to_gpu();

let mut grads = loss.backward();
optimizer.step(&mut grads)?;  "token-comment">// runs on GPU

LR Schedulers

Learning rate schedulers adjust the learning rate across epochs. All schedulers implement the LRScheduler trait. Schedulers do not own the optimizer — users read the current LR and apply it via optimizer.set_lr(scheduler.get_lr()).

rust
pub trait LRScheduler {
    "token-comment">/// Advance the scheduler by one epoch
    fn step(&mut self);

    "token-comment">/// Return the current learning rate
    fn get_lr(&self) -> type">f32;

    "token-comment">/// Override the initial learning rate
    fn set_initial_lr(&mut self, lr: type">f32);
}

StepLR

Multiplies the learning rate by gamma every step_size epochs. Produces a staircase decay pattern.

rust
use rumus::optim::lr_scheduler::StepLR;

"token-comment">// Multiply LR by gamma every step_size epochs
"token-comment">// Formula: lr = initial_lr * gamma^(epoch / step_size)
let mut scheduler = StepLR::new(
    0.1,   "token-comment">// initial_lr
    30,    "token-comment">// step_size — decay every 30 epochs
    0.1,   "token-comment">// gamma — multiply by 0.1 each time
);

"token-comment">// epoch 0-29:  lr = 0.1
"token-comment">// epoch 30-59: lr = 0.01
"token-comment">// epoch 60-89: lr = 0.001

CosineAnnealingLR

Smooth cosine decay from initial_lr down to eta_min over t_max epochs. Preferred for training runs where gradual warmdown improves final accuracy.

rust
use rumus::optim::lr_scheduler::CosineAnnealingLR;

"token-comment">// Smooth cosine decay over t_max epochs
"token-comment">// Formula: lr = eta_min + 0.5 * (initial_lr - eta_min)
"token-comment">//              * (1 + cos(pi * epoch / t_max))
let mut scheduler = CosineAnnealingLR::new(
    0.1,    "token-comment">// initial_lr
    100,    "token-comment">// t_max — full cosine period
    1e-6,   "token-comment">// eta_min — minimum learning rate
);

Applying a Scheduler

Because schedulers are decoupled from optimizers, you control exactly when and how the learning rate is updated.

rust
"token-comment">// Schedulers don't own the optimizer — users
"token-comment">// apply the LR manually each epoch.

for epoch in 0..num_epochs {
    optimizer.set_lr(scheduler.get_lr());

    for batch in &dataloader {
        "token-comment">// ... training step ...
    }

    scheduler.step();  "token-comment">// advance after each epoch
}

Gradient Clipping

The clip_grad_norm_ function clips the global L2 norm of all gradients in-place and returns the original unclipped norm. This prevents exploding gradients during training.

rust
use rumus::optim::clip_grad_norm_;

"token-comment">/// Clips the global L2 norm of all gradients.
"token-comment">/// Returns the unclipped total norm.
pub fn clip_grad_norm_(
    grads: &mut type">GradientStore,
    params: &[ParamId],
    max_norm: type">f32,
) -> type">f32;

GPU Strategy

On GPU tensors, gradient clipping uses a 3-pass strategy to minimize host-device synchronization: parallel norm reduction, a single CPU readback to compute the total norm, and a conditional scale dispatch only when clipping is needed.

rust
"token-comment">// 3-pass GPU strategy for clip_grad_norm_:
"token-comment">//
"token-comment">// Pass 1: dispatch reduce_sum_sq kernel for each
"token-comment">//         GPU gradient tensor in parallel
"token-comment">//
"token-comment">// Pass 2: read back per-tensor norms to CPU,
"token-comment">//         compute total_norm = sqrt(sum of all)
"token-comment">//
"token-comment">// Pass 3: if total_norm > max_norm, dispatch scale
"token-comment">//         kernel on each gradient:
"token-comment">//         grad *= max_norm / total_norm

Usage

Call clip_grad_norm_ after backward() and before optimizer.step().

rust
let mut grads = loss.backward();

"token-comment">// Clip gradients before the optimizer step
let total_norm = clip_grad_norm_(
    &mut grads,
    &model.parameters(),
    1.0,  "token-comment">// max_norm
);
println!("Gradient norm: {:.4}", total_norm);

optimizer.step(&mut grads)?;

DataLoader

The DataLoader batches and shuffles data from any type implementing the Dataset trait. It supports multithreaded prefetching with bounded mpsc channels for backpressure and graceful Drop teardown of worker threads.

Dataset Trait

Datasets must be Send + Sync so worker threads can access them concurrently. Each call to get(index) returns a DataItem containing an input tensor and a target tensor.

rust
"token-comment">/// Must be Send + Sync for multi-threaded loading
pub trait Dataset: Send + Sync {
    fn len(&self) -> type">usize;
    fn get(&self, index: type">usize) -> DataItem;
}

pub struct DataItem {
    pub input: type">Tensor,
    pub target: type">Tensor,
}

Creating a DataLoader

Shuffle uses Fisher-Yates per epoch. The iter() method returns a one-epoch iterator that yields batched DataItems with stacked tensors along the batch dimension.

rust
use rumus::data::{DataLoader, Dataset};

let loader = DataLoader::new(
    dataset,          "token-comment">// impl Dataset
    32,               "token-comment">// batch_size
    true,             "token-comment">// shuffle — Fisher-Yates per epoch
    false,            "token-comment">// drop_last — drop incomplete final batch
    4,                "token-comment">// num_workers — prefetch threads
    2,                "token-comment">// prefetch_factor — batches per worker
);

"token-comment">// Iterate one epoch
for batch in loader.iter() {
    "token-comment">// batch.input:  [batch_size, ...]
    "token-comment">// batch.target: [batch_size, ...]
}

Single-Thread Debugging

Set num_workers=0 to run all loading on the calling thread. This makes it straightforward to step through data loading with a debugger.

rust
"token-comment">// Use num_workers=0 for single-threaded debugging.
"token-comment">// All batching happens on the calling thread — easier
"token-comment">// to step through with a debugger.

let loader = DataLoader::new(
    dataset, 32, true, false,
    0,   "token-comment">// single-thread mode
    1,
);

RecordWriter / RecordDataset

For massive datasets, RUMUS provides a binary .rrec format with O(1) random access via a trailing index. RecordWriter::new(path) creates a new file, .append(input, target) writes samples, and .finish() flushes the index footer. RecordDataset::open(path) implements the Dataset trait using memmap2 for lock-free concurrent reads, bypassing filesystem bottlenecks entirely.

rust
use rumus::data::{RecordWriter, RecordDataset, DataLoader};
use std::sync::Arc;

let mut writer = RecordWriter::new("train.rrec")?;
writer.append(&input, &target)?;
writer.finish()?;

let dataset = Arc::new(RecordDataset::open("train.rrec")?);
let loader = DataLoader::new(dataset, 32, true, false, 4, 2);

Full Training Loop

The Trainer struct provides a high-level API that wraps forward pass, backward pass, and optimizer step into a single call. Combined with a DataLoader, CosineAnnealingLR scheduler, and clip_grad_norm_, this is a production-ready training loop.

rust
use rumus::optim::{type">Trainer, type">AdamW, clip_grad_norm_};
use rumus::optim::lr_scheduler::CosineAnnealingLR;
use rumus::data::DataLoader;
use rumus::nn;

"token-comment">// Model & optimizer
let mut optimizer = type">AdamW::new(model.parameters(), 3e-4);
let mut trainer = type">Trainer::new(optimizer);

"token-comment">// LR scheduler — cosine decay over 100 epochs
let mut scheduler = CosineAnnealingLR::new(3e-4, 100, 1e-6);

"token-comment">// DataLoader — 4 workers, prefetch 2 batches each
let loader = DataLoader::new(dataset, 64, true, true, 4, 2);

for epoch in 0..100 {
    trainer.optimizer_mut().set_lr(scheduler.get_lr());

    for batch in loader.iter() {
        trainer.train_step_with(|grads, params| {
            "token-comment">// Clip gradients before optimizer step
            clip_grad_norm_(grads, params, 1.0);
        }, || {
            let logits = model.forward(&batch.input);
            nn::cross_entropy_loss(&logits, &batch.target)
        });
    }

    scheduler.step();

    println!(
        "Epoch {:3} | lr {:.6} | loss {:.4}",
        epoch,
        scheduler.get_lr(),
        trainer.epoch_avg_loss(),
    );
}