GPU Acceleration

Transparent GPU acceleration via WebGPU. Move models and tensors to GPU with a single call — all operations automatically dispatch to WGSL compute shaders with CPU fallback.

Overview

RUMUS uses WebGPU (via wgpu) for GPU acceleration. The API is simple: call .to_gpu() on tensors or models, and all subsequent operations automatically run on the GPU. If no GPU is available, operations fall back to CPU without errors.

rust
use rumus::tensor::type">Tensor;
use rumus::nn;

"token-comment">// Move a tensor to the GPU
let x = type">Tensor::randn(&[32, 784]);
let x_gpu = x.to_gpu();

"token-comment">// Move an entire model to the GPU
model.to_gpu();

"token-comment">// All operations automatically dispatch to GPU
let output = model.forward(&x_gpu);

Moving to GPU

Tensors provide .to_gpu() to create a GPU-backed copy. Models implement the ModuleToGpu trait, which moves all parameters in-place. Every operation checks is_gpu() at dispatch time to choose the GPU or CPU path.

rust
"token-comment">// Tensors: .to_gpu() returns a GPU-backed tensor
let x = type">Tensor::randn(&[64, 3, 28, 28]);
let x_gpu = x.to_gpu();

"token-comment">// Models: .to_gpu() moves all parameters in-place
"token-comment">// via the ModuleToGpu trait
model.to_gpu();

"token-comment">// Operations check is_gpu() at dispatch time:
"token-comment">// → GPU tensor: runs WGSL compute shader
"token-comment">// → CPU tensor: runs CPU fallback
let output = model.forward(&x_gpu);

GPU Architecture

The GPU subsystem is built on three core components that work together to provide efficient, safe GPU computation.

GpuContext

A OnceLock singleton that lazily initializes the GPU device, queue, pipeline cache, and buffer pool on first use. It never panics on missing hardware.

rust
"token-comment">// GpuContext is a OnceLock singleton — initialized
"token-comment">// lazily on first GPU operation. Never panics on
"token-comment">// missing hardware; falls back to CPU gracefully.
"token-comment">//
"token-comment">// Internally holds:
"token-comment">//   - wgpu::Device
"token-comment">//   - wgpu::Queue
"token-comment">//   - PipelineCache
"token-comment">//   - BufferPool

PipelineCache

Pre-compiles 25 compute pipelines and 9 bind group layouts. Pipeline selection is by enum variant, giving compile-time guarantees that every GPU operation has a valid pipeline.

rust
"token-comment">// PipelineCache: 25 compute pipelines, 9 bind group
"token-comment">// layouts — all validated at compile time.
"token-comment">//
"token-comment">// Pipelines are created once and cached. Each GPU op
"token-comment">// looks up its pipeline by enum variant, guaranteeing
"token-comment">// no runtime pipeline compilation stalls.
"token-comment">//
"token-comment">// Bind group layouts are shared across compatible
"token-comment">// pipelines, minimizing GPU resource usage.

BufferPool

A thread-safe buffer cache using power-of-2 bucketing. Buffers are recycled via the Drop trait, eliminating repeated GPU memory allocation during training loops.

rust
"token-comment">// BufferPool: thread-safe GPU buffer cache with
"token-comment">// power-of-2 bucketing. When a buffer is dropped,
"token-comment">// it is returned to the pool for reuse.

"token-comment">// Allocation: finds the smallest power-of-2 bucket
"token-comment">// that fits the request, returns a cached buffer
"token-comment">// or allocates a new one.
let buffer = pool.allocate(1024);  "token-comment">// → 1024-byte buffer

"token-comment">// Drop: buffer is returned to its bucket
"token-comment">// automatically via the Drop trait.
drop(buffer);  "token-comment">// → recycled, not deallocated

"token-comment">// This eliminates repeated GPU memory allocation
"token-comment">// during training loops.

WGSL Shaders

RUMUS ships 13 WGSL shader modules with 40+ entry points, covering every operation needed for training and inference.

CategoryOperations
Element-wiseadd, sub, mul, div, neg, relu, exp, log, sqrt, tanh
Linear Algebramatmul, transpose, batched matmul
Convolutionconv2d forward, conv2d backward (weight & input)
Poolingmax_pool2d forward, max_pool2d backward
Reductionsum, mean, max (with argmax)
Lossmse_loss, cross_entropy_loss (forward & backward)
Optimizersgd_step, adam_step, adamw_step (fused)
Utilityfill, copy, broadcast, reshape indexing

Per-Resource Fences

Instead of global pipeline barriers, RUMUS uses per-resource fences with AtomicUsizesentinels. Each GPU buffer tracks its last submission index. Before reading, only that specific buffer's fence is awaited — independent operations on different buffers proceed in parallel without stalls.

rust
"token-comment">// Per-resource fences: each GPU buffer carries an
"token-comment">// AtomicUsize sentinel tracking its last submission.
"token-comment">//
"token-comment">// Before reading a buffer, RUMUS checks its fence:
"token-comment">//   - If the fence matches the current submission
"token-comment">//     index, the buffer is ready.
"token-comment">//   - Otherwise, only that buffer's fence is awaited.
"token-comment">//
"token-comment">// This avoids global pipeline stalls — independent
"token-comment">// operations on different buffers proceed in parallel.

Full GPU Training Example

A complete end-to-end example: define a CNN, move it to GPU, train with AdamW, and save the trained weights.

rust
use rumus::tensor::type">Tensor;
use rumus::nn::{self, type">Module, type">Linear, type">Conv2d, type">MaxPool2d, type">Flatten};
use rumus::optim::{type">Trainer, type">AdamW};

"token-attribute">#[derive(type">Module)]
struct ConvNet {
    conv1: type">Conv2d,
    pool: type">MaxPool2d,
    flatten: type">Flatten,
    fc1: type">Linear,
}

impl ConvNet {
    fn new() -> type">Self {
        type">Self {
            conv1: type">Conv2d::new(1, 32, 3),
            pool: type">MaxPool2d::new(2, 2),
            flatten: type">Flatten::new(),
            fc1: type">Linear::new(32 * 13 * 13, 10),
        }
    }

    fn forward(&self, x: &type">Tensor) -> type">Tensor {
        let x = nn::relu(&self.conv1.forward(x));
        let x = self.pool.forward(&x);
        let x = self.flatten.forward(&x);
        self.fc1.forward(&x)
    }
}

fn main() -> type">Result<(), Box<dyn std::error::Error>> {
    let mut model = ConvNet::new();
    model.to_gpu();  "token-comment">// move all parameters to GPU
    model.train();

    let optimizer = type">AdamW::new(model.parameters(), 0.001);
    let mut trainer = type">Trainer::new(optimizer);

    for epoch in 0..10 {
        for (images, labels) in &train_loader {
            let images = images.to_gpu();
            let labels = labels.to_gpu();

            trainer.train_step(|| {
                let logits = model.forward(&images);
                nn::cross_entropy_loss(&logits, &labels)
            });
        }

        let avg_loss = trainer.epoch_avg_loss();
        println!("Epoch {}: loss = {:.4}", epoch, avg_loss);
    }

    "token-comment">// Save trained model
    model.eval();
    let state = model.state_dict("");
    nn::save_safetensors(&state, "convnet.safetensors")?;

    type">Ok(())
}