GPU Acceleration
Transparent GPU acceleration via WebGPU. Move models and tensors to GPU with a single call — all operations automatically dispatch to WGSL compute shaders with CPU fallback.
Overview
RUMUS uses WebGPU (via wgpu) for GPU acceleration. The API is simple: call .to_gpu() on tensors or models, and all subsequent operations automatically run on the GPU. If no GPU is available, operations fall back to CPU without errors.
use rumus::tensor::type">Tensor;
use rumus::nn;
"token-comment">// Move a tensor to the GPU
let x = type">Tensor::randn(&[32, 784]);
let x_gpu = x.to_gpu();
"token-comment">// Move an entire model to the GPU
model.to_gpu();
"token-comment">// All operations automatically dispatch to GPU
let output = model.forward(&x_gpu);Moving to GPU
Tensors provide .to_gpu() to create a GPU-backed copy. Models implement the ModuleToGpu trait, which moves all parameters in-place. Every operation checks is_gpu() at dispatch time to choose the GPU or CPU path.
"token-comment">// Tensors: .to_gpu() returns a GPU-backed tensor
let x = type">Tensor::randn(&[64, 3, 28, 28]);
let x_gpu = x.to_gpu();
"token-comment">// Models: .to_gpu() moves all parameters in-place
"token-comment">// via the ModuleToGpu trait
model.to_gpu();
"token-comment">// Operations check is_gpu() at dispatch time:
"token-comment">// → GPU tensor: runs WGSL compute shader
"token-comment">// → CPU tensor: runs CPU fallback
let output = model.forward(&x_gpu);GPU Architecture
The GPU subsystem is built on three core components that work together to provide efficient, safe GPU computation.
GpuContext
A OnceLock singleton that lazily initializes the GPU device, queue, pipeline cache, and buffer pool on first use. It never panics on missing hardware.
"token-comment">// GpuContext is a OnceLock singleton — initialized
"token-comment">// lazily on first GPU operation. Never panics on
"token-comment">// missing hardware; falls back to CPU gracefully.
"token-comment">//
"token-comment">// Internally holds:
"token-comment">// - wgpu::Device
"token-comment">// - wgpu::Queue
"token-comment">// - PipelineCache
"token-comment">// - BufferPoolPipelineCache
Pre-compiles 25 compute pipelines and 9 bind group layouts. Pipeline selection is by enum variant, giving compile-time guarantees that every GPU operation has a valid pipeline.
"token-comment">// PipelineCache: 25 compute pipelines, 9 bind group
"token-comment">// layouts — all validated at compile time.
"token-comment">//
"token-comment">// Pipelines are created once and cached. Each GPU op
"token-comment">// looks up its pipeline by enum variant, guaranteeing
"token-comment">// no runtime pipeline compilation stalls.
"token-comment">//
"token-comment">// Bind group layouts are shared across compatible
"token-comment">// pipelines, minimizing GPU resource usage.BufferPool
A thread-safe buffer cache using power-of-2 bucketing. Buffers are recycled via the Drop trait, eliminating repeated GPU memory allocation during training loops.
"token-comment">// BufferPool: thread-safe GPU buffer cache with
"token-comment">// power-of-2 bucketing. When a buffer is dropped,
"token-comment">// it is returned to the pool for reuse.
"token-comment">// Allocation: finds the smallest power-of-2 bucket
"token-comment">// that fits the request, returns a cached buffer
"token-comment">// or allocates a new one.
let buffer = pool.allocate(1024); "token-comment">// → 1024-byte buffer
"token-comment">// Drop: buffer is returned to its bucket
"token-comment">// automatically via the Drop trait.
drop(buffer); "token-comment">// → recycled, not deallocated
"token-comment">// This eliminates repeated GPU memory allocation
"token-comment">// during training loops.WGSL Shaders
RUMUS ships 13 WGSL shader modules with 40+ entry points, covering every operation needed for training and inference.
| Category | Operations |
|---|---|
| Element-wise | add, sub, mul, div, neg, relu, exp, log, sqrt, tanh |
| Linear Algebra | matmul, transpose, batched matmul |
| Convolution | conv2d forward, conv2d backward (weight & input) |
| Pooling | max_pool2d forward, max_pool2d backward |
| Reduction | sum, mean, max (with argmax) |
| Loss | mse_loss, cross_entropy_loss (forward & backward) |
| Optimizer | sgd_step, adam_step, adamw_step (fused) |
| Utility | fill, copy, broadcast, reshape indexing |
Per-Resource Fences
Instead of global pipeline barriers, RUMUS uses per-resource fences with AtomicUsizesentinels. Each GPU buffer tracks its last submission index. Before reading, only that specific buffer's fence is awaited — independent operations on different buffers proceed in parallel without stalls.
"token-comment">// Per-resource fences: each GPU buffer carries an
"token-comment">// AtomicUsize sentinel tracking its last submission.
"token-comment">//
"token-comment">// Before reading a buffer, RUMUS checks its fence:
"token-comment">// - If the fence matches the current submission
"token-comment">// index, the buffer is ready.
"token-comment">// - Otherwise, only that buffer's fence is awaited.
"token-comment">//
"token-comment">// This avoids global pipeline stalls — independent
"token-comment">// operations on different buffers proceed in parallel.Full GPU Training Example
A complete end-to-end example: define a CNN, move it to GPU, train with AdamW, and save the trained weights.
use rumus::tensor::type">Tensor;
use rumus::nn::{self, type">Module, type">Linear, type">Conv2d, type">MaxPool2d, type">Flatten};
use rumus::optim::{type">Trainer, type">AdamW};
"token-attribute">#[derive(type">Module)]
struct ConvNet {
conv1: type">Conv2d,
pool: type">MaxPool2d,
flatten: type">Flatten,
fc1: type">Linear,
}
impl ConvNet {
fn new() -> type">Self {
type">Self {
conv1: type">Conv2d::new(1, 32, 3),
pool: type">MaxPool2d::new(2, 2),
flatten: type">Flatten::new(),
fc1: type">Linear::new(32 * 13 * 13, 10),
}
}
fn forward(&self, x: &type">Tensor) -> type">Tensor {
let x = nn::relu(&self.conv1.forward(x));
let x = self.pool.forward(&x);
let x = self.flatten.forward(&x);
self.fc1.forward(&x)
}
}
fn main() -> type">Result<(), Box<dyn std::error::Error>> {
let mut model = ConvNet::new();
model.to_gpu(); "token-comment">// move all parameters to GPU
model.train();
let optimizer = type">AdamW::new(model.parameters(), 0.001);
let mut trainer = type">Trainer::new(optimizer);
for epoch in 0..10 {
for (images, labels) in &train_loader {
let images = images.to_gpu();
let labels = labels.to_gpu();
trainer.train_step(|| {
let logits = model.forward(&images);
nn::cross_entropy_loss(&logits, &labels)
});
}
let avg_loss = trainer.epoch_avg_loss();
println!("Epoch {}: loss = {:.4}", epoch, avg_loss);
}
"token-comment">// Save trained model
model.eval();
let state = model.state_dict("");
nn::save_safetensors(&state, "convnet.safetensors")?;
type">Ok(())
}