Tensors
The foundational data structure in RUMUS. Tensors hold multi-dimensional arrays of f32 values with automatic differentiation support built in.
Creating Tensors
Use Tensor::new to create a tensor from a flat data vector and a shape. The data is stored in row-major order.
use rumus::tensor::type">Tensor;
"token-comment">// Create a 2x3 tensor from a flat vector
let t = type">Tensor::new(
vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
vec![2, 3],
);
"token-comment">// Access the shape
assert_eq!(t.shape(), &[2, 3]);Shapes and Views
RUMUS distinguishes between view operations and operations that allocate new storage. View ops like reshape and transpose are zero-copy — they return a new tensor that shares the same underlying memory but with a different layout descriptor.
Layout descriptor: Each tensor has a shape, strides, and offset. Views modify these fields without touching the data buffer, making them O(1) regardless of tensor size.
"token-comment">// Reshape is a zero-copy view operation.
"token-comment">// The underlying storage is shared — only the
"token-comment">// layout(shape, strides, offset) changes.
let t = type">Tensor::new(
vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
vec![2, 3],
);
let reshaped = t.reshape(vec![3, 2]);
assert_eq!(reshaped.shape(), &[3, 2]);
"token-comment">// Transpose also creates a view — no data is copied.
"token-comment">// The permutation vector reorders the axes.
let transposed = t.transpose(vec![1, 0]);
assert_eq!(transposed.shape(), &[3, 2]);Arithmetic Operations
All arithmetic operations are differentiable and automatically recorded on the autograd tape when the input tensors are tracked. This includes element-wise ops, matrix multiplication, and activation functions.
let a = type">Tensor::new(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2]);
let b = type">Tensor::new(vec![5.0, 6.0, 7.0, 8.0], vec![2, 2]);
"token-comment">// Element-wise operations — all are differentiable
"token-comment">// and recorded on the autograd tape.
let sum = a.add(&b);
let diff = a.sub(&b);
let product = a.mul(&b); "token-comment">// element-wise multiply
"token-comment">// Matrix multiplication
let mm = a.matmul(&b);
"token-comment">// Activation functions live in the nn module
use rumus::nn;
let activated = nn::relu(&sum);
"token-comment">// type">Dropout (only active in training mode)
let dropped = a.dropout(0.5);Data Access
Tensor data is protected by an RwLock. Call t.data() for shared read access or t.data_write() for exclusive write access. Guards are dropped automatically when they go out of scope.
let t = type">Tensor::new(vec![1.0, 2.0, 3.0], vec![3]);
"token-comment">// Read access returns an RwLockReadGuard.
"token-comment">// Multiple readers can hold this simultaneously.
{
let data = t.data();
println!("first element: {}", data[0]);
} "token-comment">// guard is dropped here
"token-comment">// Write access returns an RwLockWriteGuard.
"token-comment">// Exclusive — blocks all other readers and writers.
{
let mut data = t.data_write();
data[0] = 42.0;
}Storage Model
Under the hood, tensor storage is partitioned across CPU and GPU memory. The runtime tracks where the canonical copy lives and lazily synchronizes when needed. This is transparent to user code — you work with the same Tensor type regardless of device placement.
"token-comment">// type">Tensor storage is partitioned into three variants:
"token-comment">//
"token-comment">// Cpu(type">Vec<type">f32>) — data lives on the CPU only
"token-comment">// Gpu(wgpu::Buffer) — data lives on the GPU only
"token-comment">// Both { cpu, gpu, dirty }
"token-comment">// — data exists in both locations.
"token-comment">// The `dirty` flag tracks which copy is stale.
"token-comment">//
"token-comment">// View operations(reshape, transpose) share the same
"token-comment">// storage and only modify the Layout descriptor:
"token-comment">//
"token-comment">// Layout { shape, strides, offset }
"token-comment">//
"token-comment">// This means reshaping a 1 GB tensor is instantaneous
"token-comment">// and uses zero additional memory.
"token-comment">// Autograd tracking is stored per-tensor:
"token-comment">//
"token-comment">// AutogradState::type">None
"token-comment">// — not tracked(constants, inference mode)
"token-comment">// AutogradState::Tracked { grad_id, creator_op, is_leaf }
"token-comment">// — participates in the computation graphN-Dimensional Broadcasting
RUMUS supports full N-dimensional broadcasting with PyTorch/NumPy semantics. Shapes are aligned from the right, and each dimension must be equal or 1. Size-1 dimensions are stretched using stride-0 indexing on the GPU, which means zero intermediate allocation — no temporary expanded tensor is ever materialized.
Stride-0 trick: When a dimension has size 1 and needs to broadcast, the GPU kernel sets its stride to 0. The same element is read for every index along that axis, avoiding any memory copy or expansion.
"token-comment">// N-dimensional broadcasting follows PyTorch/NumPy rules:
"token-comment">// 1. Shapes are aligned from the right
"token-comment">// 2. Each dimension must be equal or 1
"token-comment">// 3. Size-1 dims are "stretched" via stride-0 indexing
"token-comment">//
"token-comment">// This means zero intermediate allocation — the GPU
"token-comment">// kernel reads the broadcast source with stride 0.
let a = type">Tensor::new(vec![1.0, 2.0, 3.0], vec![1, 3]);
let b = type">Tensor::new(
vec![10.0, 20.0, 30.0, 40.0, 50.0, 60.0],
vec![2, 3],
);
"token-comment">// broadcast_add: [1,3] + [2,3] → [2,3]
let sum = a.broadcast_add(&b);
"token-comment">// broadcast_sub and broadcast_mul work the same way
let diff = a.broadcast_sub(&b);
let prod = a.broadcast_mul(&b);
"token-comment">// Higher-rank example: [4,1,3] + [1,5,3] → [4,5,3]
"token-comment">// No temporary tensor is created — stride-0 on GPU.Batched Matrix Multiply
The bmm method performs batched matrix multiplication over 3D tensors. Each slice along the batch dimension is an independent matrix multiply: [B,M,K] @ [B,K,N] → [B,M,N]. This is the core primitive for multi-head attention.
"token-comment">// Batched matrix multiply: [B, M, K] @ [B, K, N] → [B, M, N]
"token-comment">// Each slice in the batch dimension is an independent matmul.
let a = type">Tensor::new(vec!["token-comment">/* ... */], vec![8, 4, 16]); "token-comment">// [B=8, M=4, K=16]
let b = type">Tensor::new(vec!["token-comment">/* ... */], vec![8, 16, 8]); "token-comment">// [B=8, K=16, N=8]
let out = a.bmm(&b); "token-comment">// → shape [8, 4, 8]
assert_eq!(out.shape(), &[8, 4, 8]);Softmax
Row-wise softmax with Log-Sum-Exp numerical stability. The implementation subtracts the row maximum before exponentiating, preventing overflow for large logit values. Fully differentiable and recorded on the autograd tape.
"token-comment">// Row-wise softmax with Log-Sum-Exp numerical stability.
"token-comment">// For each row r:
"token-comment">// max_r = max(row)
"token-comment">// shifted = row - max_r
"token-comment">// out = exp(shifted) / sum(exp(shifted))
"token-comment">//
"token-comment">// This prevents overflow for large logits.
let logits = type">Tensor::new(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2]);
let probs = logits.softmax();
"token-comment">// Each row sums to 1.0Activation Functions
In addition to nn::relu, RUMUS now provides activation functions as direct tensor methods. All are differentiable with their own BackwardOp variants for efficient gradient computation.
"token-comment">// New activation functions are available as tensor methods.
"token-comment">// All are differentiable and recorded on the autograd tape.
let x = type">Tensor::new(vec![-1.0, 0.0, 0.5, 2.0], vec![2, 2]);
"token-comment">// Sigmoid: 1 / (1 + exp(-x))
let s = x.sigmoid();
"token-comment">// Tanh activation
let t = x.tanh_act();
"token-comment">// GELU: x * Phi(x), the Gaussian Error type">Linear Unit
let g = x.gelu();
"token-comment">// Leaky ReLU: max(x, alpha * x) for negative inputs
let lr = x.leaky_relu(0.01);Tracked View Operations
Standard view ops like reshape and transpose are zero-copy but do not participate in autograd. The tracked variants below are recorded on the tape so gradients flow through them correctly during the backward pass.
"token-comment">// Tracked view ops participate in autograd just like
"token-comment">// data-moving ops — gradients flow back through them.
"token-comment">// transpose_tracked: swap two dimensions, recorded on tape
let t = type">Tensor::new(vec!["token-comment">/* ... */], vec![4, 8]);
let t_t = t.transpose_tracked(0, 1); "token-comment">// shape [8, 4]
"token-comment">// batched_transpose: swap last two dims of a 3D tensor
"token-comment">// Useful for attention: [B, S, D] → [B, D, S]
let batched = type">Tensor::new(vec!["token-comment">/* ... */], vec![2, 4, 8]);
let bt = batched.batched_transpose(); "token-comment">// shape [2, 8, 4]
"token-comment">// contiguous_tracked: materialize a contiguous copy
"token-comment">// when stride tricks are insufficient(e.g. before bmm)
let contig = bt.contiguous_tracked();Layer and Batch Normalization
Normalization operations are available as tensor methods. Layer normalization normalizes over the last dimension (used in transformers), while batch normalization normalizes per channel over the batch and spatial dimensions (used in CNNs).
"token-comment">// Layer normalization as a tensor method.
"token-comment">// Normalizes over the last dimension.
let x = type">Tensor::new(vec!["token-comment">/* ... */], vec![2, 4]);
let weight = type">Tensor::new(vec![1.0; 4], vec![4]);
let bias = type">Tensor::new(vec![0.0; 4], vec![4]);
let normed = x.layer_norm(&weight, &bias, 1e-5);
"token-comment">// Batch normalization for 2D feature maps.
"token-comment">// Operates over [N, C, H, W] — normalizes per channel.
let feature_map = type">Tensor::new(vec!["token-comment">/* ... */], vec![8, 16, 32, 32]);
let bn_out = feature_map.batch_norm_2d(
&gamma, &beta, &running_mean, &running_var, 1e-5, true,
);Embedding Lookup
The embedding_forward method gathers rows from a weight matrix based on integer indices stored in the tensor. This is the standard embedding table lookup used for token embeddings in language models.
"token-comment">// Embedding lookup: index into a weight matrix.
"token-comment">// Input tensor holds integer indices(stored as type">f32).
let indices = type">Tensor::new(vec![0.0, 3.0, 1.0, 2.0], vec![2, 2]);
"token-comment">// Weight matrix: [vocab_size, embed_dim]
let embed_weights = type">Tensor::new(vec!["token-comment">/* ... */], vec![10, 64]);
"token-comment">// Forward: gather rows from embed_weights
let embedded = indices.embedding_forward(&embed_weights);
"token-comment">// Output shape: [2, 2, 64]Adaptive Average Pooling
Adaptive average pooling reduces spatial dimensions to a fixed output size regardless of input resolution. This is commonly used as the bridge between convolutional feature maps and fully connected layers, or as a global average pooling layer when the output size is 1x1.
"token-comment">// Adaptive average pooling: reduces spatial dims to
"token-comment">// a fixed output size regardless of input resolution.
"token-comment">// Operates on [N, C, H, W] tensors.
let x = type">Tensor::new(vec!["token-comment">/* ... */], vec![1, 3, 224, 224]);
"token-comment">// Pool down to 7x7
let pooled = x.adaptive_avg_pool2d(7, 7);
assert_eq!(pooled.shape(), &[1, 3, 7, 7]);
"token-comment">// Pool down to 1x1 (global average pooling)
let gap = x.adaptive_avg_pool2d(1, 1);
assert_eq!(gap.shape(), &[1, 3, 1, 1]);DType and Mixed Precision
RUMUS supports multiple data types via the DType enum. Currently F32 and F16 are available. Use tensor.to_dtype(DType::F16) to cast — this is zero-copy when the dtype already matches, and dispatches a GPU cast kernel otherwise.
CPU inspection: F16 tensors automatically cast to F32 when you call .data(), so you always get readable f32 values on the CPU side. GPU buffer alignment for F16 elements is enforced to a 4-byte boundary to satisfy WebGPU requirements.
use rumus::tensor::{type">Tensor, DType};
"token-comment">// DType enum: currently F32 (4 bytes) and F16 (2 bytes)
"token-comment">//
"token-comment">// DType::F32.byte_size() → 4
"token-comment">// DType::F16.byte_size() → 2
"token-comment">// DType::F16.gpu_buf_size(1024) → aligned buffer size for 1024 F16 elements
"token-comment">// Cast a tensor to a different dtype
let t = type">Tensor::new(vec![1.0, 2.0, 3.0], vec![3]);
assert_eq!(t.dtype(), DType::F32);
"token-comment">// to_dtype: zero-copy if same dtype, GPU cast kernel otherwise
let t_f16 = t.to_dtype(DType::F16);
assert_eq!(t_f16.dtype(), DType::F16);
"token-comment">// Casting back is also a single call
let t_f32 = t_f16.to_dtype(DType::F32);
"token-comment">// Query current dtype at any time
assert_eq!(t_f32.dtype(), DType::F32);
"token-comment">// F16 tensors auto-cast to F32 when calling .data()
"token-comment">// for CPU inspection — no manual conversion needed.
let f16_tensor = t.to_dtype(DType::F16);
let cpu_data = f16_tensor.data(); "token-comment">// returns type">f32 values
"token-comment">// GPU buffer alignment: F16 elements are aligned to
"token-comment">// a 4-byte boundary to satisfy WebGPU requirements.
"token-comment">// DType::F16.gpu_buf_size(numel) accounts for this.INT8 Quantization
RUMUS supports symmetric block quantization via DType::Q8 { block_size }. Each block stores a 4-byte header (f16 scale) followed by packed i8 data. This reduces model weight memory by ~4x with minimal accuracy loss, enabling faster inference on memory-constrained devices.
Inference only: Quantization and dequantization are untracked operations — they are not recorded on the autograd tape and do not participate in gradient computation. Use them for inference, not training.
tensor.quantize(block_size)— F32/F16 to Q8, column-major repacking for cache hits. Theblock_sizemust be divisible by 4.tensor.dequantize()— Q8 to F32, untracked.tensor.matmul(&q8_weights)— mixed-precision matmul: scalar activations x Q8 weights produce scalar output, with on-the-fly dequantization in GPU registers.- Q8 tensors auto-dequantize when calling
.data()for CPU inspection.
"token-comment">// Quantize model weights for inference
let q8_weight = weight.quantize(128);
"token-comment">// Mixed-precision forward pass
let output = activation.matmul(&q8_weight);
"token-comment">// Inspect quantized tensor
let dequantized = q8_weight.dequantize();INT4 Quantization
Beyond INT8, RUMUS supports INT4 group-wise quantization with asymmetric scaling (AWQ/GPTQ) via the rumus-vision crate. This further reduces weight memory by ~2x compared to INT8, enabling larger models to fit in GPU memory for inference.
- Packed format: 2 INT4 values per byte, with group-wise scales and zero-points stored alongside the packed data
- Group-wise scaling: weights are divided into groups of
group_sizeelements, each with its own scale and zero-point for asymmetric quantization - Alignment: the K-dimension is padded to the nearest
group_sizemultiple for correct WGSL dispatch alignment - Available via
rumus-visioncrate usingQuantizedTensor::from_f32()andQLinear::from_linear()
Memory-Mapped Data (.rrec format)
RUMUS provides a custom binary data format for high-throughput dataset access. RecordWriter creates .rrec files consisting of a 64-byte header, sequential data blocks, and a trailing index. RecordDataset opens these files via memmap2 for O(1) random access and implements the Dataset trait.
Concurrent reads: Memory-mapped access is lock-free — multiple threads can read different records simultaneously, bypassing filesystem bottlenecks that plague traditional file-based datasets.
use rumus::data::{RecordWriter, RecordDataset, Dataset};
"token-comment">// Write a .rrec file: header(64B) + data blocks + trailing index
let mut writer = RecordWriter::create("train.rrec")?;
for sample in &samples {
writer.write_record(sample)?;
}
writer.finish()?; "token-comment">// flushes trailing index
"token-comment">// Open via memmap2 for O(1) random access
let dataset = RecordDataset::open("train.rrec")?;
let sample = dataset.get(42)?; "token-comment">// seeks directly via index
"token-comment">// RecordDataset implements the Dataset trait:
"token-comment">// fn len(&self) -> type">usize;
"token-comment">// fn get(&self, index: type">usize) -> type">Result<Record>;
"token-comment">//
"token-comment">// Lock-free concurrent reads — multiple threads can
"token-comment">// read different records simultaneously, bypassing
"token-comment">// filesystem bottlenecks.