Built with Rust

Deep learning with memory safety

RUMUS is a native-Rust deep learning framework that satisfies the borrow checker while delivering PyTorch-like ergonomics. From CNNs to Transformers — zero-cost abstractions, GPU acceleration, and compile-time safety guarantees.

Get Started Documentation

$cargo add rumus

Why RUMUS?

Most deep learning frameworks achieve safety through runtime reference counting. RUMUS enforces memory safety as a first-class language constraint — checked at compile time, not runtime.

Memory Safe by Design

Rust's borrow checker enforces memory safety at compile time. No runtime reference counting, no data races, no use-after-free — guaranteed by the type system.

Zero-Cost Abstractions

View operations like reshape and transpose are metadata-only — zero memory allocation. Inference mode completely bypasses the autograd tape with no overhead.

GPU Acceleration

WGPU compute backend with 30+ WGSL shader modules including FlashAttention, JIT kernel fusion, and multi-GPU DataParallel/FSDP. Per-resource fences and buffer pooling for zero-allocation training.

PyTorch-Like Ergonomics

Familiar eager execution model. Define-by-run autograd. Module system with #[derive(Module)] proc macro. If you know PyTorch, you already know RUMUS.

Transformer Ready

Multi-head attention with FlashAttention, LayerNorm, embeddings, and causal masking. Train GPT-style models end-to-end on WebGPU. Serve with rumus-serve continuous batching inference server.

Production Ready

FP16/INT8/INT4 quantization, ONNX export, JIT fusion, 3D parallelism (DP + FSDP + TP + PP), inference server, graph engine, and direct-convolution vision engine.

Three orthogonal layers

The Tensor is not a junk drawer. RUMUS strictly partitions internal state into three independent layers, each with a single responsibility. This makes the framework auditable, predictable, and easy to extend.

Storage Layer

Raw memory management with CPU/GPU unified addressing, version tracking, and per-resource fences

Layout Layer

Shape, strides, and view semantics — reshape and transpose are zero-allocation metadata operations

Autograd Layer

Gradient tracking with append-only Wengert tape, Kahn's algorithm backward pass, and concrete BackwardOp enum

// Tensor anatomy

Tensor

StorageHandle

Cpu(Vec<f32>) | Gpu(wgpu::Buffer) | Both {...}

Layout

shape, strides, offset — views share storage

AutogradState

None | Tracked(grad_id, creator_op, is_leaf)

Familiar and expressive

If you know PyTorch, you already know RUMUS. Define models with structs, derive the Module trait, and train with eager execution.

examples/train.rs

rust

use rumus::nn::{self, type">Linear, type">Module};
use rumus::optim::type">Adam;
use rumus::autograd;
use rumus::type">Tensor;

"token-attribute">#[derive(type">Module)]
struct Net {
    fc1: type">Linear,
    fc2: type">Linear,
}

impl Net {
    fn new() -> type">Self {
        type">Self {
            fc1: type">Linear::new(784, 128),
            fc2: type">Linear::new(128, 10),
        }
    }

    fn forward(&self, x: &type">Tensor) -> type">Tensor {
        let h = nn::relu(&self.fc1.forward(x));
        self.fc2.forward(&h)
    }
}

fn main() -> type">Result<(), Box<dyn std::error::Error>> {
    let model = Net::new();
    let mut opt = type">Adam::new(model.parameters(), 0.001);

    for epoch in 0..100 {
        let pred = model.forward(&inputs);
        let loss = nn::cross_entropy_loss(&pred, &targets);

        let mut grads = autograd::backward(&loss)?;
        opt.step(&mut grads)?;

        println!("Epoch {epoch}: loss = {:.4}", loss.item());
    }

    nn::save_safetensors(&model.state_dict(""), "model.safetensors")?;
    type">Ok(())
}

Complete training ecosystem

Everything you need to define, train, and deploy deep learning models — from autograd to GPU-fused optimizers.

Layers & Modules

Linear, Conv2d, ConvTranspose2d, MaxPool2d, AdaptiveAvgPool2d, BatchNorm2d, LayerNorm, Dropout, Embedding — with automatic parameter collection via #[derive(Module)].

LinearConv2dConvTranspose2dBatchNorm2dLayerNormEmbeddingDropout

Optimizers

SGD, Adam, and AdamW — all GPU-fused. LR schedulers (StepLR, CosineAnnealing), gradient clipping, and a multithreaded DataLoader with prefetching.

SGDAdamAdamWStepLRCosineAnnealingclip_grad_norm_DataLoader

Activations & Loss

ReLU, GELU, Sigmoid, Tanh, LeakyReLU, and Softmax activations. MSE and Cross-Entropy loss with Log-Sum-Exp numerical stability.

ReLUGELUSigmoidTanhSoftmaxMSECross-Entropy

Transformer & Attention

FlashAttention, JIT kernel fusion, 3D parallelism (DataParallel, FSDP, Tensor Parallel, Pipeline Parallel), INT4 quantized inference, and continuous batching via rumus-serve.

FlashAttentionJIT FusionDataParallelFSDPTensorParallelPipelineParallelINT4rumus-serve

Ready to get started?

Add RUMUS to your Rust project and start building memory-safe deep learning models today.

Installation Guide View on GitHub