pu-rs.org – Processing Unit Ranking System
The SPECfp for AI accelerators.
FLOPS don’t tell the full story. A chip rated at 1000 TFLOPS means nothing if your softmax kernel only achieves 5% utilization. pu-rs.org measures what matters: actual kernel execution time on real hardware, for the operations that AI workloads actually run.
Why this exists
| What we measure | What others report |
|---|---|
| Softmax latency at (64, 4096) f16 | Peak TFLOPS |
| LayerNorm throughput per watt | Memory bandwidth (theoretical) |
| MatMul efficiency vs roofline | Marketing benchmarks |
| Cost per real GOPS | Cloud $/hour (opaque) |
Scope
We benchmark the kernel primitives that compose every AI model:
| Category | Kernels |
|---|---|
| Activation | Softmax, GELU, SiLU |
| Normalization | LayerNorm, RMSNorm |
| Linear Algebra | GEMM, batched MatMul |
| Attention | Scaled Dot-Product Attention |
| Quantization | VQ-Quantize, INT8 dequant |
| Convolution | Conv1D, dilated Conv1D |
| Reduction | Scatter-add, L1-smooth loss |
Devices covered
| Type | Vendors |
|---|---|
| GPU | NVIDIA (A100, H100, H200, B200), AMD (MI300X), Apple (M2/M4 Max), Cambricon |
| TPU | Google (v5e, v6e Trillium) |
| NPU | Huawei Ascend (910B, 910C), AWS Trainium2, Intel Gaudi 3 |
How it works
- Run standardized benchmark scripts on your hardware
- Submit CSV results via pull request
- CI validates format and sanity checks
- Leaderboard updates automatically with per-kernel rankings
All results tagged with git SHA, driver version, toolchain, and number of runs. Median latency reported. Full methodology.
End-to-end complement
Per-kernel latency is only half the story — a chip can win on softmax and still lose on a real model. The DeepSeek decode page reports end-to-end throughput across five accelerators (Ascend 910B2, TPU v2-8, Apple M2 Max, NVIDIA T4, AWS Trainium1) from the same 13-kernel Rust source emitted through the ascend-rs MLIR backends.
Built with ascend-rs kernel infrastructure. Data updated weekly.
Leaderboard
| # | Vendor | Device | Type | Kernel | Dtype | Shape | Latency (us) | GOPS | GOPS/$ | GOPS/W | Verified |
|---|
DeepSeek Decode Throughput (Cross-Vendor)
Per-kernel latency (the rest of this site) is the microbenchmark view. This page reports the complementary end-to-end decode throughput for a full DeepSeek-R1-Distill-Qwen-1.5B decode step, measured on five production accelerators from the same 13-kernel Rust source emitted through ascend-rs’s per-vendor MLIR backends.
Headline
| Rank | Device | Backend | Emitted language | Decode tok/s |
|---|---|---|---|---|
| 1 | Huawei Ascend 910B2 | mlir_to_cpp+mlir_to_pto | AscendC C++ + PTO-MLIR | 168.9 |
| 2 | Google TPU v2-8 (Colab) | mlir_to_tpu | Pallas | 162.9 |
| 3 | Apple M2 Max | mlir_to_msl | Metal | 91.7 |
| 4 | NVIDIA T4 (Colab) | mlir_to_gpu | CUDA | 53.7 |
| 5 | AWS Trainium1 (trn1.2xlarge) | mlir_to_nki | NKI Python | 12.2 |
| – | CPU reference | – | plain Rust | 3.7 |
All five numbers come from identical Rust kernel sources. The MLIR → vendor backend is the only thing that changes between rows. 168.9 tok/s on 910B2 is 2.47× the aclnn-only baseline and 45.6× the CPU reference.
Why this complements the kernel leaderboard
The kernel leaderboard tells you which chip runs a given softmax or GEMM fastest in isolation. Decode tok/s tells you what a real inference workload actually achieves once those kernels are composed with host-side launch overhead, KV cache traffic, and HBM pressure. A chip can win per-kernel and still lose on decode (the Trainium row is the clearest example — strong per-op latency, 9.5% bandwidth utilisation end-to-end).
Notes per device
- Ascend 910B2 — the
+ptohalf of the joint path contributes the four decode matmul shapes (1.75×–2.98× vs aclnn). RMSNorm stays on CPU; every other op is on-NPU. See the main ascend-rs blog ch10 for the per-kernel breakdown. - TPU v2-8 — Colab-visible Pallas on a 4-chip v2 pod; the emitted kernel
set is
rms_norm,matvec_f16, and an attention fusion. - Apple M2 Max — emitted Metal beats Apple’s hand-tuned MLX on decode at
this model size. Measured via
deepseek_metal. - NVIDIA T4 — Colab Tesla T4; same three kernels as the TPU row, emitted in CUDA. 53.7 tok/s is below-roofline (T4 is HBM-bandwidth-starved for this model shape).
- AWS Trainium1 —
trn1.2xlarge. Six emitted NKI kernels (rms_norm, threematvec_f16variants,gate_up_silu). Traced viatorch_neuronx.tracein two halves (eager single-NEFF runs at 2.5 tok/s — 5× slower — because the single compile unit can’t pipeline across the whole decode path). Trace time: 461 s; wall time: 5.23 s for 64 decode steps.
Reproduction
Each number is reproducible with the commands documented in the per-device
sections of ascend-rs ch10.
The source kernels live at
crates/ascend_std/src/tile.rs
in the public repo; the per-vendor emitters live in
crates/rustc_codegen_mlir/src/mlir_to_*.rs.
The per-kernel leaderboard remains the authoritative view for isolated-kernel efficiency. Decode throughput is reported here as the complementary end-to-end metric.
Cost Effectiveness
The most important metric for deployment decisions: how much real performance do you get per dollar and per watt?
| # | Device | Kernel | Latency (us) | MSRP ($) | TDP (W) | GOPS/$ | GOPS/W |
|---|
Methodology
Measurement protocol
- Warmup: 50 iterations discarded
- Measurement: 500 iterations, median latency reported
- Amortization: Dispatch overhead amortized by batching 500 kernel launches into one command buffer where supported
- Isolation: Benchmarks run on idle systems, no background GPU workloads
What we measure
Kernel-only time: the GPU/NPU execution time for a single kernel dispatch, excluding:
- Host-to-device data transfer (data assumed resident)
- Command buffer creation overhead (amortized)
- Python/framework overhead
This isolates the hardware+compiler efficiency from the software stack.
Reporting
| Metric | Definition |
|---|---|
| Latency (us) | Median kernel execution time in microseconds |
| GOPS | Throughput: operations / latency |
| GOPS/$ | Throughput / device MSRP in USD |
| GOPS/W | Throughput / TDP in watts |
Standardized configurations
Each kernel is benchmarked at these canonical shapes:
| Kernel | Shapes | Dtypes |
|---|---|---|
| Softmax | (1,1024), (64,1024), (64,4096) | f32, f16 |
| LayerNorm | (1,768), (64,768), (1024,768) | f32, f16 |
| GEMM | (1024,1024,1024), (4096,4096,4096) | f32, f16, bf16 |
| Attention | (1,32,128,128), (32,32,2048,128) | f32, f16 |
How to submit
See Submit Results.
GEMM (MatMul)
Category: Linear Algebra | Complexity: O(M·K·N) | Compute: Cube/Tensor Core bound
Algorithm
Dense matrix multiplication: C[M×N] = A[M×K] × B[K×N].
The fundamental ML primitive — dominates runtime in transformers (linear projections, attention scores, FFN layers). Performance depends on tiling strategy, memory hierarchy utilization, and hardware matrix units (cube engines, tensor cores).
ascend-rs Kernel Source
Matrix multiplication in ascend-rs uses the tile API, which compiles to hardware-specific matmul units (cube engine on Ascend, tensor cores on CUDA, etc.).
Safe entry form — kernel body is pure safe Rust, shape committed at the type level:
#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};
#[ascend_std::aiv_kernel]
pub fn tile_matmul(
a: GmView<'_, M, K, f32>,
b: GmView<'_, K, N, f32>,
output: GmViewMut<'_, M, N, f32>,
) {
let at = tile_load_view_f32(&a);
let bt = tile_load_view_f32(&b);
let c = safe::tile_matmul_f32(at, bt);
tile_store_view_f32(&output, c);
}
}
No unsafe blocks in the body. A mismatched K between the two operands is a compile-time error. The #[aiv_kernel] attribute rewrites the emitted signature to raw *const f32 / *mut f32 so the launcher toolchain sees an unchanged C ABI — #[repr(transparent)] makes the rewrite free at the LLVM IR level.
This compiles via rustc_codegen_mlir → MLIR → target code on all 9 backends:
- Ascend AIV: PTO-MLIR
pto.tmatmul→ cube engine (320 TFLOPS f16 on 910B) - CUDA:
__shared__tiled GEMM with__syncthreads() - Apple Metal / Vulkan SPIR-V: compute shader with shared-memory tiling
- AWS NKI (Trainium):
nki.isa.nc_matmul - AMD AIE: AIE2P cascade matmul
- Cambricon BANG:
__bang_matmulon MLU tensor units - Intel Gaudi: HPU matmul intrinsic
- Google TPU: XLA
dot_generalvia OpenXLA
For benchmarking, vendor-optimized libraries are used: aclnnMatmul (Ascend), cuBLAS (CUDA), MPSMatrixMultiplication (Metal).
Benchmark configurations
| Shape (A × B) | FLOPs | Notes |
|---|---|---|
| [1024, 1024] × [1024, 1024] | 2.1 G | Small square, tests dispatch overhead |
| [4096, 4096] × [4096, 4096] | 137 G | Standard benchmark, bandwidth→compute transition |
| [8192, 8192] × [8192, 8192] | 1.1 T | Large square, saturates compute units |
| [16384, 16384] × [16384, 16384] | 8.8 T | Full hardware saturation |
| [1024, 4096] × [4096, 1024] | 8.6 G | Rectangular, typical FFN down-projection |
| [4096, 1024] × [1024, 4096] | 34.4 G | Rectangular, typical FFN up-projection |
| [2048, 8192] × [8192, 2048] | 67.1 G | Transformer-scale attention projection |
All benchmarks use f16 input with f16 output (or f32 accumulation where supported).
Results
| Device | Shape | Latency (μs) | TFLOPS | GOPS/W |
|---|---|---|---|---|
| Ascend 910B | [4096²]×[4096²] | 437 | 314.5 | 1014 |
| Ascend 910B | [8192²]×[8192²] | 3,614 | 304.2 | 981 |
| Ascend 910B | [16384²]×[16384²] | 27,467 | 320.2 | 1033 |
| Ascend 910B | [2048, 8192]×[8192, 2048] | 245 | 280.0 | 903 |
| Ascend 910B | [4096, 1024]×[1024, 4096] | 132 | 260.1 | 839 |
| Apple M2 Max | [4096²]×[4096²] | 17,374 | 7.9 | 53 |
| Apple M2 Max | [8192²]×[8192²] | 139,596 | 7.9 | 53 |
| Apple M2 Max | [2048, 8192]×[8192, 2048] | 8,972 | 7.7 | 51 |
| Apple M2 Max | [4096, 1024]×[1024, 4096] | 4,345 | 7.9 | 53 |
| Tesla T4 | [4096²]×[4096²] | 5,698 | 24.1 | 345 |
| Tesla T4 | [8192²]×[8192²] | 44,099 | 24.9 | 356 |
| Tesla T4 | [2048, 8192]×[8192, 2048] | 2,567 | 26.8 | 383 |
| Tesla T4 | [4096, 1024]×[1024, 4096] | 1,549 | 22.2 | 317 |
Peak: 320 TFLOPS (f16) on Ascend 910B — saturating the theoretical maximum. Tesla T4 peaks at 26.8 TFLOPS (f16) via cuBLAS (torch.matmul). Apple M2 Max peaks at 7.9 TFLOPS (f16) via MPSMatrixMultiplication.
See Leaderboard filtered to MatMul for the full filterable view.
Attention (Scaled Dot-Product)
Category: Attention | Complexity: O(B·H·S²·D) | Compute: Cube/Tensor Core bound
Algorithm
Scaled dot-product attention: Output = softmax(Q·K^T / √d) · V
The core transformer primitive — computes attention weights from queries and keys, applies softmax normalization, then produces a weighted sum of values. Dominates runtime in all transformer architectures (GPT, BERT, LLaMA, etc.).
Pipeline:
- Scores = Q × K^T — matmul (S×D) × (D×S) → (S×S)
- Scale by 1/√d — element-wise multiply
- Softmax along last axis — numerically stable (max → sub → exp → sum → div)
- Output = Weights × V — matmul (S×S) × (S×D) → (S×D)
ascend-rs Kernel Source
The attention pipeline in ascend-rs combines tile-API matmul with custom Rust kernels for scale and softmax:
#![allow(unused)]
fn main() {
use ascend_rs::prelude::*;
let scale = 1.0f32 / (d_k as f32).sqrt();
// Step 1: scores = Q × K^T (HGEMM via cube engine)
acl_blas_hgemm(TransN, TransT, TransN,
seq_len, seq_len, d_k,
&alpha, &d_q, d_k, &d_k_mat, d_k,
&beta, &mut d_scores, seq_len,
HighPrecision, &stream)?;
// Step 2: scores *= 1/√d_k (custom Rust kernel → NPU)
scale_kernel.launch(1, &stream, &mut [
d_scores.as_mut_ptr(), // in-place
d_scores.as_mut_ptr(), // output (same buffer)
d_n_scores.as_mut_ptr(),
d_scale.as_mut_ptr(),
])?;
// Step 3: weights = softmax(scores) (custom Rust kernel → NPU)
softmax_kernel.launch(1, &stream, &mut [
d_scores.as_mut_ptr(),
d_weights.as_mut_ptr(),
d_row_len.as_mut_ptr(),
d_num_rows.as_mut_ptr(),
])?;
// Step 4: output = weights × V (HGEMM via cube engine)
acl_blas_hgemm(TransN, TransN, TransN,
seq_len, d_k, seq_len,
&alpha, &d_weights, seq_len, &d_v, d_k,
&beta, &mut d_output, d_k,
HighPrecision, &stream)?;
}
The scale and softmax kernels are written in Rust and compiled via rustc_codegen_mlir → MLIR → backend code. The GEMMs use vendor-optimized libraries (aclnnMatmul, cuBLAS, MPSMatrixMultiplication).
Backend status for the fused safe::tile_attention_f32 tile op: Ascend AIV, Cambricon BANG, Intel Gaudi, Apple Metal, Vulkan SPIR-V (5/9). CUDA / AWS NKI / AMD AIE / Google TPU lowerings are TODO — on those backends the pipeline still runs as separate matmul + softmax + matmul dispatches using the individually-lowered tile ops.
Benchmark configurations
| Shape (B, H, S, D) | FLOPs | Notes |
|---|---|---|
| (1, 1, 128, 64) | 4.2 M | Small baseline, dispatch overhead test |
| (1, 1, 512, 64) | 67 M | Medium sequence |
| (1, 1, 1024, 64) | 268 M | GPT-2 scale |
| (1, 1, 2048, 64) | 1.1 G | Long context |
| (1, 1, 4096, 64) | 4.3 G | Very long context (quadratic scaling) |
| (1, 8, 512, 64) | 537 M | 8-head, GPT-2 like |
| (1, 12, 512, 64) | 805 M | 12-head, BERT-base |
| (1, 32, 512, 64) | 2.1 G | 32-head, LLaMA-7B |
| (1, 32, 1024, 128) | 17.2 G | 32-head, LLaMA-2-7B |
| (1, 32, 2048, 128) | 68.7 G | 32-head, long context |
All benchmarks use f16 input with f16 output. FLOPs ≈ 4·B·H·S²·D (two matmuls dominate).
Results
| Device | Shape (B,H,S,D) | Latency (μs) | TFLOPS | Notes |
|---|---|---|---|---|
| Ascend 910B | (1,32,1024,128) | 310 | 55.4 | aclnnMatmul+Softmax, manual pipeline |
| Ascend 910B | (1,32,2048,128) | 1,459 | 47.1 | Memory-bound at long context |
| Ascend 910B | (1,1,4096,64) | 149 | 28.9 | Single-head, large S |
| Ascend 910B | (1,32,512,64) | 105 | 20.4 | 32-head, short context |
| Tesla T4 | (1,32,1024,128) | 2,609 | 6.6 | F.scaled_dot_product_attention |
| Tesla T4 | (1,32,2048,128) | 5,067 | 13.6 | Flash attention backend |
| Tesla T4 | (1,1,4096,64) | 974 | 4.4 | Single-head |
| Tesla T4 | (1,32,512,64) | 427 | 5.0 | 32-head |
| Apple M2 Max | (1,32,1024,128) | 138,819 | 0.12 | MPS GEMM + CPU softmax |
| Apple M2 Max | (1,1,4096,64) | 60,647 | 0.07 | MPS GEMM + CPU softmax |
Peak: 55.4 TFLOPS (f16) on Ascend 910B. Tesla T4 peaks at 13.6 TFLOPS (f16) via PyTorch SDPA. Apple M2 Max peaks at 0.14 TFLOPS — bottlenecked by CPU softmax (no fused MPS attention).
See Leaderboard filtered to Attention for the full filterable view.
Softmax
Category: Activation | Complexity: O(N) per row | Memory: 2 passes over input
Algorithm
The online 2-pass softmax (Milakov & Gimelshein 2018):
Pass 1 (single traversal): Maintain running (max, sum) pair per thread. When a new maximum is found, rescale the accumulated sum:
sum_new = sum_old * exp(max_old - max_new) + exp(x - max_new)
Pass 2: Write exp(x - global_max) / global_sum per element.
This is 33% less memory traffic than the naive 3-pass algorithm (max, exp+sum, normalize).
ascend-rs Kernel Source
Softmax in ascend-rs uses the buffer API for element-wise backends and the tile API for matrix-oriented backends:
Scalar kernel (f32, benchmarked implementation):
#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub fn softmax(input: *const f32, output: *mut f32, len: *const u32) {
let n = *len as usize;
// Step 1: Find max for numerical stability
let mut max_val = *input;
let mut i = 1usize;
loop {
if i >= n { break; }
let val = *input.wrapping_add(i);
if val > max_val { max_val = val; }
i += 1;
}
// Step 2: exp(x - max) and accumulate sum
let mut sum: f32 = 0.0;
i = 0;
loop {
if i >= n { break; }
let exp_val = (*input.wrapping_add(i) - max_val).exp();
*output.wrapping_add(i) = exp_val;
sum += exp_val;
i += 1;
}
// Step 3: Normalize
i = 0;
loop {
if i >= n { break; }
*output.wrapping_add(i) = *output.wrapping_add(i) / sum;
i += 1;
}
}
}
Tile API — safe entry form (lowered by rustc_codegen_mlir to all 9 backends: Ascend AIV, CUDA, Apple Metal, Vulkan SPIR-V, AWS NKI, AMD AIE, Cambricon BANG, Intel Gaudi, Google TPU):
#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};
#[ascend_std::aiv_kernel]
pub fn tile_softmax(
input: GmView<'_, 1, 1024, f32>,
output: GmViewMut<'_, 1, 1024, f32>,
) {
let x = tile_load_view_f32(&input);
let y = safe::tile_softmax_f32(x);
tile_store_view_f32(&output, y);
}
}
The kernel body is pure safe Rust — no unsafe blocks. Shape (rows, cols, dtype) is committed at the type level via const generics, so any host-side mismatch becomes a compile-time error. The #[aiv_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 so the launcher/compiler toolchain (bisheng / ACL / nvcc) sees the same C ABI. #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level — the two forms emit as literal symbol aliases.
Kernels compile via rustc_codegen_mlir → MLIR → target-specific code. Softmax is one of the four “hot path” tile ops (alongside matmul, rms-norm, silu) that is lowered on every backend currently targeted.
Benchmark configurations
| Shape | Elements | Bytes (f32) | Notes |
|---|---|---|---|
| (1, 1024) | 1K | 4 KB | L1-resident, tests dispatch overhead |
| (64, 1024) | 64K | 256 KB | L2-resident, typical batch |
| (64, 4096) | 256K | 1 MB | Bandwidth-bound regime |
Results
See Leaderboard filtered to Softmax for the full filterable view.
Causal Mask
Category: Masking | Complexity: O(S^2) elementwise | Memory: 1 pass (read+write)
Algorithm
Causal masking sets the upper triangle of the attention score matrix to negative infinity, preventing tokens from attending to future positions:
For i, j in [0..S) x [0..S):
if j > i: scores[i,j] = -inf
else: scores[i,j] = scores[i,j]
Applied between Q@K^T and softmax in autoregressive (decoder) attention:
scores = Q @ K^T / sqrt(d)
scores = causal_mask(scores) <-- this kernel
weights = softmax(scores)
This is memory-bandwidth bound (simple conditional copy), but critical for correctness in all decoder-only models (GPT, LLaMA, etc.).
ascend-rs Kernel Source
Causal mask using the tile API — safe entry form:
#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};
#[ascend_std::aiv_kernel]
pub fn tile_causal_mask(
input: GmView<'_, 64, 64, f32>,
output: GmViewMut<'_, 64, 64, f32>,
) {
let scores = tile_load_view_f32(&input);
let masked = safe::tile_causal_mask_f32(scores);
tile_store_view_f32(&output, masked);
}
}
The kernel body is pure safe Rust — shape (rows, cols, dtype) is committed at the type level via const generics, so any host-side mismatch becomes a compile-time error. Square-shape enforcement (rows == cols) is also enforced at the type level. The #[aiv_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 so the launcher toolchain sees the same C ABI; #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level.
Backend status (lowered by rustc_codegen_mlir): Cambricon BANG, Intel Gaudi, Apple Metal, Vulkan SPIR-V (4/9). Ascend AIV / CUDA / AWS NKI / AMD AIE / Google TPU lowerings are TODO — on those backends causal masking is currently applied as a buffer-API element-wise compare-and-select rather than a single fused tile op.
Benchmark configurations
| Shape (S, S) | Elements | Bytes (f32) | Notes |
|---|---|---|---|
| (64, 64) | 4K | 16 KB | Small attention window |
| (128, 128) | 16K | 64 KB | Standard context |
| (256, 256) | 65K | 256 KB | Medium context |
| (512, 512) | 262K | 1 MB | Long context |
Results
See Leaderboard filtered to Causal Mask for the full filterable view.
RoPE (Rotary Position Embedding)
Category: Positional Encoding | Complexity: O(S*D) elementwise | Memory: 2 passes (read+write, plus cos/sin tables)
Algorithm
RoPE (Su et al. 2021) encodes position by rotating pairs of dimensions at frequency-dependent rates:
For each pair (x[2i], x[2i+1]):
theta = pos / 10000^(2i/d)
x'[2i] = x[2i]*cos(theta) - x[2i+1]*sin(theta)
x'[2i+1] = x[2i]*sin(theta) + x[2i+1]*cos(theta)
Used in every modern LLM (LLaMA, Mistral, GPT-NeoX, Qwen, etc.) to encode token position in Q/K vectors. RoPE is bandwidth-bound for short sequences and compute-bound (cos/sin) for long sequences.
ascend-rs Kernel Source
RoPE using the tile API — safe entry form:
#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};
#[ascend_std::aiv_kernel]
pub fn tile_rope(
input: GmView<'_, 1, 128, f32>,
output: GmViewMut<'_, 1, 128, f32>,
) {
let x = tile_load_view_f32(&input);
let y = safe::tile_rope_f32(x, 0); // base position = 0
tile_store_view_f32(&output, y);
}
}
The kernel body is pure safe Rust — shape (rows, cols, dtype) is committed at the type level via const generics, so any host-side mismatch becomes a compile-time error. The #[aiv_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 so the launcher toolchain sees the same C ABI; #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level.
Backend status (lowered by rustc_codegen_mlir): Cambricon BANG, Intel Gaudi, Apple Metal, Vulkan SPIR-V (4/9). Ascend AIV / CUDA / AWS NKI / AMD AIE / Google TPU lowerings are TODO — on those backends RoPE is currently expressed as a buffer-API composition of element-wise cos/sin/mul/add rather than a single fused tile op.
Benchmark configurations
| Shape (B, S, D) | Elements | Bytes (f32) | Notes |
|---|---|---|---|
| (1, 64, 128) | 8K | 32 KB | Single query, short context |
| (32, 64, 128) | 262K | 1 MB | Batched queries |
| (1, 128, 128) | 16K | 64 KB | Longer head dim |
Results
See Leaderboard filtered to RoPE for the full filterable view.
LayerNorm
Category: Normalization | Complexity: O(N) per row | Memory: 3 passes
Algorithm
3-pass fused: mean, variance, normalize+affine in one workgroup:
- Mean: Parallel sum reduction, divide by N
- Variance: Parallel sum of (x - mean)^2, compute inverse std
- Affine:
gamma * (x - mean) * inv_std + beta
Uses SIMD group shuffles for warp-level reductions (1 threadgroup barrier instead of 8).
ascend-rs Kernel Source
LayerNorm in ascend-rs using vectorized AscendC intrinsics (f32, benchmarked implementation):
#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub fn layernorm(input: *const f32, output: *mut f32, len_buf: *const u32) {
let n = *len_buf;
let eps = 1.0e-5f32;
let in_buf = ascend_std::ascend_buf_alloc(n);
let out_buf = ascend_std::ascend_buf_alloc(n);
let work = ascend_std::ascend_buf_alloc(n);
let rwork = ascend_std::ascend_buf_alloc(n);
// DMA load: GM -> local buffer
ascend_std::ascend_buf_load_f32(in_buf, input, n);
ascend_std::ascend_pipe_barrier();
// Step 1: mean = sum(x) / n
let sum_val = ascend_std::ascend_reduce_sum_f32(work, in_buf, rwork, n);
let mean = sum_val / (n as f32);
// Step 2: centered = x - mean
ascend_std::ascend_adds_f32(out_buf, in_buf, -mean, n);
ascend_std::ascend_pipe_barrier();
// Step 3: var = sum((x - mean)^2) / n
ascend_std::ascend_mul_f32(work, out_buf, out_buf, n);
ascend_std::ascend_pipe_barrier();
let var_sum = ascend_std::ascend_reduce_sum_f32(work, work, rwork, n);
let inv_std = 1.0 / (var_sum / (n as f32) + eps).sqrt();
// Step 4: output = centered * inv_std
ascend_std::ascend_muls_f32(out_buf, out_buf, inv_std, n);
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_buf_store_f32(output, out_buf, n);
}
}
This buffer-API kernel is the primary implementation and runs on the Ascend AIV backend. A tile-API safe::tile_layernorm_f32 variant is additionally lowered by rustc_codegen_mlir to Apple Metal (1/9) — the other 8 backend lowerings (Ascend AIV / CUDA / Vulkan SPIR-V / AWS NKI / AMD AIE / Cambricon BANG / Intel Gaudi / Google TPU) are future work. On non-Metal backends, LayerNorm is currently composed at the buffer API as shown above (mean → sub → mul² → mean → sqrt → mul) rather than emitted as a single tile op.
Benchmark configurations
| Shape | Notes |
|---|---|
| (1, 768) | GPT-2 hidden dim, single position |
| (64, 768) | Typical batch |
| (1024, 768) | Large batch |
Results
See Leaderboard filtered to LayerNorm for the full filterable view.
RMS Norm
Category: Normalization | Complexity: O(N) per row | Memory: 2 passes
Algorithm
RMSNorm (Zhang & Sennrich 2019) is a simplified LayerNorm used in LLaMA, Gemma, and most modern LLMs. It omits the mean-centering step:
- RMS: Compute root-mean-square:
rms = sqrt(mean(x²) + ε) - Normalize + Scale:
y = (x / rms) * gamma
Compared to LayerNorm, RMSNorm saves one reduction pass (no mean computation) and one elementwise subtraction, yielding ~15% faster inference at equal accuracy.
ascend-rs Kernel Source
RMS Norm using ascend-rs buffer API (f32):
#![allow(unused)]
fn main() {
/// RMS Norm: y[i] = (x[i] / rms) * gamma[i]
/// where rms = sqrt(mean(x²) + eps)
///
/// params: [n: u32]
#[ascend_std::aiv_kernel]
pub fn rms_norm(
input: *const f32,
gamma: *const f32,
output: *mut f32,
params: *const u32,
) {
let n = *params;
let eps = 1.0e-5f32;
let in_buf = ascend_std::ascend_buf_alloc(n);
let gamma_buf = ascend_std::ascend_buf_alloc(n);
let work = ascend_std::ascend_buf_alloc(n);
let rwork = ascend_std::ascend_buf_alloc(n);
// Load input and gamma
ascend_std::ascend_buf_load_f32(in_buf, input, n);
ascend_std::ascend_buf_load_f32(gamma_buf, gamma, n);
ascend_std::ascend_pipe_barrier();
// Step 1: x² → work
ascend_std::ascend_mul_f32(work, in_buf, in_buf, n);
ascend_std::ascend_pipe_barrier();
// Step 2: rms = sqrt(mean(x²) + eps)
let sq_sum = ascend_std::ascend_reduce_sum_f32(work, work, rwork, n);
let inv_rms = 1.0 / (sq_sum / (n as f32) + eps).sqrt();
// Step 3: output = (x * inv_rms) * gamma
ascend_std::ascend_muls_f32(work, in_buf, inv_rms, n);
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_mul_f32(work, work, gamma_buf, n);
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_buf_store_f32(output, work, n);
}
}
This buffer-API kernel runs on the Ascend AIV backend. A tile-API safe::tile_rms_norm_f32 variant is additionally lowered by rustc_codegen_mlir to all 9 backends (Ascend AIV, CUDA, Apple Metal, Vulkan SPIR-V, AWS NKI, AMD AIE, Cambricon BANG, Intel Gaudi, Google TPU) — RMS Norm is one of the four “hot path” tile ops (alongside matmul, softmax, silu) that is lowered on every backend currently targeted.
Benchmark configurations
| Shape | Elements | Bytes (f32) | Notes |
|---|---|---|---|
| (1, 768) | 768 | 3 KB | GPT-2 hidden dim, single token |
| (1, 4096) | 4K | 16 KB | LLaMA-7B hidden dim |
| (64, 768) | 49K | 192 KB | Typical batch, GPT-2 |
| (64, 4096) | 262K | 1 MB | Typical batch, LLaMA |
| (1024, 4096) | 4.2M | 16 MB | Large batch, bandwidth-bound |
All benchmarks use f32.
Results
See Leaderboard filtered to RMS Norm for the full filterable view.
GELU
Category: Activation | Complexity: O(N) elementwise | Memory: 1 pass (fused read+write)
Algorithm
GELU (Gaussian Error Linear Unit, Hendrycks & Gimpel 2016) is the standard activation in BERT, GPT, LLaMA, and most transformer models:
GELU(x) = x · Φ(x) = x · 0.5 · (1 + erf(x / √2))
The fast tanh approximation (used in PyTorch gelu(approximate='tanh')):
GELU(x) ≈ 0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³)))
GELU is memory-bandwidth bound — the compute-to-byte ratio is low (a few FLOPs per 4-byte element), so peak throughput is measured in GB/s rather than TFLOPS.
ascend-rs Kernel Source
GELU using ascend-rs buffer API (f32, tanh approximation):
#![allow(unused)]
fn main() {
/// GELU activation: y = 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
///
/// params: [n: u32]
#[ascend_std::aiv_kernel]
pub fn gelu(
input: *const f32,
output: *mut f32,
params: *const u32,
) {
let n = *params as usize;
let sqrt_2_pi: f32 = 0.7978845608; // sqrt(2/pi)
let coeff: f32 = 0.044715;
let mut i: usize = 0;
while i < n {
let x = *input.wrapping_add(i);
let x3 = x * x * x;
let inner = sqrt_2_pi * (x + coeff * x3);
// tanh via exp: tanh(z) = (e^2z - 1)/(e^2z + 1)
let e2z = (2.0 * inner).exp();
let tanh_val = (e2z - 1.0) / (e2z + 1.0);
*output.wrapping_add(i) = 0.5 * x * (1.0 + tanh_val);
i += 1;
}
}
}
Vectorized version using buffer intrinsics:
#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub fn gelu_vec(
input: *const f32,
output: *mut f32,
params: *const u32,
) {
let n = *params;
let in_buf = ascend_std::ascend_buf_alloc(n);
let work = ascend_std::ascend_buf_alloc(n);
let work2 = ascend_std::ascend_buf_alloc(n);
ascend_std::ascend_buf_load_f32(in_buf, input, n);
ascend_std::ascend_pipe_barrier();
// x³
ascend_std::ascend_mul_f32(work, in_buf, in_buf, n);
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_mul_f32(work, work, in_buf, n);
ascend_std::ascend_pipe_barrier();
// 0.044715 * x³
ascend_std::ascend_muls_f32(work, work, 0.044715, n);
ascend_std::ascend_pipe_barrier();
// x + 0.044715 * x³
ascend_std::ascend_add_f32(work, in_buf, work, n);
ascend_std::ascend_pipe_barrier();
// sqrt(2/pi) * (x + 0.044715 * x³)
ascend_std::ascend_muls_f32(work, work, 0.7978845608, n);
ascend_std::ascend_pipe_barrier();
// Store result
ascend_std::ascend_buf_store_f32(output, work, n);
}
}
These buffer-API kernels run on the Ascend AIV backend via rustc_codegen_mlir. No tile-API safe::tile_gelu_f32 currently exists — tile-API lowerings on all 9 backends (Ascend AIV / CUDA / Apple Metal / Vulkan SPIR-V / AWS NKI / AMD AIE / Cambricon BANG / Intel Gaudi / Google TPU) are future work. Cross-backend execution today goes through the buffer-API scalar loop or the element-wise intrinsic composition shown above.
Benchmark configurations
| Shape | Elements | Bytes (f32) | Notes |
|---|---|---|---|
| (1, 768) | 768 | 3 KB | GPT-2 hidden dim |
| (1, 4096) | 4K | 16 KB | LLaMA hidden dim |
| (64, 768) | 49K | 192 KB | Typical batch |
| (64, 4096) | 262K | 1 MB | Bandwidth-bound |
| (1024, 4096) | 4.2M | 16 MB | Large batch |
All benchmarks use f32.
Results
See Leaderboard filtered to GELU for the full filterable view.
SiLU / Swish
Category: Activation | Complexity: O(N) elementwise | Memory: 1 pass (fused read+write)
Algorithm
SiLU (Sigmoid Linear Unit), also known as Swish (Ramachandran et al. 2017), is the gate activation in LLaMA, Mistral, and most modern LLMs:
SiLU(x) = x * sigmoid(x) = x / (1 + exp(-x))
Used in the FFN block:
FFN(x) = SiLU(W_gate * x) * (W_up * x)
Like GELU, SiLU is memory-bandwidth bound. The compute-to-byte ratio is low (a few FLOPs per element), so throughput is measured in GB/s.
ascend-rs Kernel Source
SiLU using the tile API — safe entry form (lowered by rustc_codegen_mlir to all 9 backends: Ascend AIV, CUDA, Apple Metal, Vulkan SPIR-V, AWS NKI, AMD AIE, Cambricon BANG, Intel Gaudi, Google TPU):
#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};
#[ascend_std::aiv_kernel]
pub fn tile_silu(
input: GmView<'_, 1, 4096, f32>,
output: GmViewMut<'_, 1, 4096, f32>,
) {
let x = tile_load_view_f32(&input);
let y = safe::tile_silu_f32(x);
tile_store_view_f32(&output, y);
}
}
The kernel body is pure safe Rust — shape (rows, cols, dtype) is committed at the type level via const generics, so any host-side mismatch becomes a compile-time error. The #[aiv_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 so the launcher toolchain sees the same C ABI; #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level.
safe::tile_silu_f32 decomposes to: neg → exp → add_scalar(1) → reciprocal → mul with original x. SiLU is one of the four “hot path” tile ops (alongside matmul, softmax, rms-norm) that is lowered on every backend currently targeted.
Benchmark configurations
| Shape | Elements | Bytes (f32) | Notes |
|---|---|---|---|
| (1, 768) | 768 | 3 KB | GPT-2 hidden dim |
| (1, 4096) | 4K | 16 KB | LLaMA hidden dim |
| (64, 4096) | 262K | 1 MB | Typical batch |
| (1024, 4096) | 4.2M | 16 MB | Large batch |
Results
See Leaderboard filtered to SiLU for the full filterable view.
Embedding Lookup
Category: Memory Access | Complexity: O(N*D) gather | Memory: Random access (bandwidth-bound)
Algorithm
Embedding lookup gathers rows from a (V, D) weight table by token indices:
For each token index t[i] in [0..V):
output[i, :] = weight[t[i], :]
This is the first operation in any transformer: tokens (integers) become vectors. It is purely bandwidth-bound with random access patterns, making it a key memory subsystem benchmark.
ascend-rs Kernel Source
Embedding using the tile API — safe entry form with one unsafe block (the indices pointer is an integer gather source, not a tile, so safe::tile_embedding_f32 is declared pub unsafe fn):
#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};
#[ascend_std::aiv_kernel]
pub fn tile_embedding(
weight: GmView<'_, 32000, 128, f32>, // (V, D) codebook
indices: *const u32, // (N,) token ids — integer gather source
output: GmViewMut<'_, 32, 128, f32>, // (N, D) gathered rows
) {
let w = tile_load_view_f32(&weight);
// SAFETY: `indices` is a valid *const u32 of length COUNT=32, guaranteed by
// the launcher. The unsafe wrapper is the only non-safe surface.
let emb = unsafe { safe::tile_embedding_f32(w, indices) };
tile_store_view_f32(&output, emb);
}
}
Weight table and output shapes are committed at the type level via const generics (V, D, N), so any host-side mismatch becomes a compile-time error. The #[aiv_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 for the tile params so the launcher toolchain sees the same C ABI; #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level.
Backend status (lowered by rustc_codegen_mlir): Cambricon BANG, Intel Gaudi, Apple Metal, Vulkan SPIR-V. Ascend AIV / CUDA / AWS NKI / AMD AIE / Google TPU lowerings are TODO.
Benchmark configurations
| Shape (N, V, D) | Output Elements | Bytes (f32) | Notes |
|---|---|---|---|
| (32, 32000, 128) | 4K | 16 KB | LLaMA-2 vocab, small dim |
| (128, 32000, 128) | 16K | 64 KB | Larger batch |
| (32, 32000, 4096) | 131K | 512 KB | Full hidden dim |
Results
See Leaderboard filtered to Embedding for the full filterable view.
Cross-Entropy Loss
Category: Loss Function | Complexity: O(N*V) reduction | Memory: 2 passes (max + sum-exp)
Algorithm
Cross-entropy loss is the standard training objective for classification and language modeling:
loss[i] = -logits[i, target[i]] + log(sum(exp(logits[i, :])))
Numerically stable version (log-sum-exp trick):
m = max(logits[i, :])
loss[i] = -(logits[i, target[i]] - m) + log(sum(exp(logits[i, :] - m)))
This kernel is compute-heavy for large vocabularies (V=32000+) due to the row-wise exp and reduction. It combines softmax-like reduction with an index gather.
ascend-rs Kernel Source
Cross-entropy using the tile API — safe entry form with one unsafe block (the targets pointer is an integer gather source, not a tile, so safe::tile_cross_entropy_f32 is declared pub unsafe fn):
#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};
#[ascend_std::aiv_kernel]
pub fn tile_cross_entropy(
logits: GmView<'_, 32, 32000, f32>, // (N, V)
targets: *const u32, // (N,) target class ids — integer gather
loss: GmViewMut<'_, 32, 1, f32>, // (N, 1) per-row loss
) {
let x = tile_load_view_f32(&logits);
// SAFETY: `targets` is a valid *const u32 of length R=32, guaranteed by
// the launcher. The unsafe wrapper is the only non-safe surface.
let y = unsafe { safe::tile_cross_entropy_f32(x, targets) };
tile_store_view_f32(&loss, y);
}
}
Logits and loss shapes (and their shared N) are committed at the type level via const generics, so any host-side mismatch becomes a compile-time error. The #[aiv_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 for the tile params so the launcher toolchain sees the same C ABI; #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level.
Backend status (lowered by rustc_codegen_mlir): Cambricon BANG, Intel Gaudi. Ascend AIV / CUDA / Apple Metal / Vulkan SPIR-V / AWS NKI / AMD AIE / Google TPU lowerings are TODO — this is the narrowest backend coverage of any kernel page, reflecting that cross-entropy is primarily a training-loss primitive.
Benchmark configurations
| Shape (N, V) | Elements | Bytes (f32) | Notes |
|---|---|---|---|
| (32, 32000) | 1M | 4 MB | LLaMA-2 vocab, small batch |
| (128, 32000) | 4M | 16 MB | Larger batch |
| (32, 50257) | 1.6M | 6.4 MB | GPT-2 vocab |
Results
See Leaderboard filtered to Cross-Entropy for the full filterable view.
Dilated Conv1D + ReLU
Category: Convolution | Complexity: O(B·T·C²·3) | Fusion: pad + gather + matmul + ReLU
Algorithm
A dilated 1D convolution with kernel size 3, fused with bias and ReLU activation. Used in VQ-VAE encoder/decoder ResConv1DBlocks (e.g., SOKE, Jukebox-style models).
The naive implementation requires:
- Pad the input by
dilationon each side (zero-padding) - Gather 3 positions per output:
[t-d, t, t+d] - Concat along the channel axis ->
(B, T, 3C) - Linear projection
(3C -> C) - ReLU
This kernel fuses all 5 steps into a single GPU pass, eliminating three intermediate (B, T, 3C) buffer allocations and the data shuffles between them.
Why fusion matters
For an 18-block VQ-VAE encoder/decoder, the unfused version allocates 54 intermediate tensors per forward pass and reads them back. Fusing into one kernel:
- Eliminates intermediate buffer writes/reads (3x memory bandwidth reduction)
- Keeps activations in registers/L1 cache between stages
- One command buffer dispatch instead of five
ascend-rs Kernel Source
Vectorized dilated conv1d + ReLU using ascend-rs buffer API (f32, benchmarked implementation):
#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub fn conv1d_dilated(input: *const f32, output: *mut f32, params: *const u32) {
let n = *params;
let dilation = *params.wrapping_add(1);
let w0 = f32::from_bits(*params.wrapping_add(2));
let w1 = f32::from_bits(*params.wrapping_add(3));
let w2 = f32::from_bits(*params.wrapping_add(4));
let bias = f32::from_bits(*params.wrapping_add(5));
let aligned_n = ((n + 7) / 8) * 8;
let in_buf = ascend_std::ascend_buf_alloc(aligned_n);
let tap_left = ascend_std::ascend_buf_alloc(aligned_n);
let tap_right = ascend_std::ascend_buf_alloc(aligned_n);
let acc = ascend_std::ascend_buf_alloc(aligned_n);
let work = ascend_std::ascend_buf_alloc(aligned_n);
ascend_std::ascend_buf_load_f32(in_buf, input, n);
ascend_std::ascend_pipe_barrier();
// Build shifted tap buffers with zero-padding
ascend_std::ascend_buf_fill_f32(tap_left, 0.0, aligned_n);
let mut i = dilation;
while i < n {
let v = ascend_std::ascend_get_value_f32(in_buf, i - dilation);
ascend_std::ascend_set_value_f32(tap_left, i, v);
i += 1;
}
ascend_std::ascend_buf_fill_f32(tap_right, 0.0, aligned_n);
i = 0;
while i + dilation < n {
let v = ascend_std::ascend_get_value_f32(in_buf, i + dilation);
ascend_std::ascend_set_value_f32(tap_right, i, v);
i += 1;
}
// Vector MAC: acc = tap_left*w0 + input*w1 + tap_right*w2 + bias
ascend_std::ascend_muls_f32(acc, tap_left, w0, n);
ascend_std::ascend_muls_f32(work, in_buf, w1, n);
ascend_std::ascend_add_f32(tap_left, acc, work, n);
ascend_std::ascend_muls_f32(work, tap_right, w2, n);
ascend_std::ascend_add_f32(acc, tap_left, work, n);
ascend_std::ascend_adds_f32(acc, acc, bias, n);
ascend_std::ascend_maxs_f32(acc, acc, 0.0, n); // ReLU
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_buf_store_f32(output, acc, n);
}
}
This buffer-API kernel runs on the Ascend AIV backend via rustc_codegen_mlir. No tile-API safe::tile_conv1d_f32 currently exists — tile-API lowerings on all 9 backends (Ascend AIV / CUDA / Apple Metal / Vulkan SPIR-V / AWS NKI / AMD AIE / Cambricon BANG / Intel Gaudi / Google TPU) are future work. On non-Ascend backends the fused pad+gather+matmul+ReLU is currently expressed as a buffer-API composition rather than a single tile op.
Benchmark configurations
| Shape (B, T, C) | Elements | Notes |
|---|---|---|
| (2, 50, 512) | 51 K | Single VQ-VAE block, small batch |
| (8, 100, 512) | 410 K | Mid-sized clip |
| (2, 400, 512) | 410 K | Long sequence |
Results
See Leaderboard filtered to conv1d-dilated for the full filterable view.
VQ Quantize + EMA Update
Category: Quantization | Complexity: O(N·K·D) | Fusion: L2 distance + argmin + scatter-add
Algorithm
Vector quantization maps each input vector to its nearest codebook entry, then updates the codebook via exponential moving average (EMA). Used in VQ-VAE training (SOKE, Jukebox, SoundStream).
Pipeline:
- L2 distance: For each input vector x[i] (dim D), compute
||x[i] - c[k]||²against all K codebook entries - Argmin: Find nearest codebook entry
k* = argmin_k ||x[i] - c[k]||² - Quantize: Output
q[i] = c[k*](the nearest codebook vector) - EMA scatter-add: Accumulate
x[i]into codebook slotk*for EMA update:sum[k*] += x[i],count[k*] += 1
Fusing all 4 steps into one kernel eliminates 3 intermediate buffers (distance matrix, index array, scatter workspace).
ascend-rs Kernel Source
VQ quantize kernel using ascend-rs buffer API (f32):
#![allow(unused)]
fn main() {
/// VQ Quantize: for each input vector, find nearest codebook entry (L2),
/// output the quantized vector, and scatter-add for EMA codebook update.
///
/// params: [n_vectors: u32, n_codes: u32, dim: u32]
#[ascend_std::aiv_kernel]
pub fn vq_quantize(
input: *const f32, // (N, D) input vectors
codebook: *const f32, // (K, D) codebook
output: *mut f32, // (N, D) quantized output
cb_sum: *mut f32, // (K, D) EMA numerator accumulator
cb_count: *mut u32, // (K,) EMA denominator counter
params: *const u32,
) {
let n = *params; // number of input vectors
let k = *params.wrapping_add(1); // codebook size
let d = *params.wrapping_add(2); // vector dimension
let buf_x = ascend_std::ascend_buf_alloc(d); // current input vector
let buf_c = ascend_std::ascend_buf_alloc(d); // current codebook entry
let buf_diff = ascend_std::ascend_buf_alloc(d); // x - c
let buf_work = ascend_std::ascend_buf_alloc(d);
let buf_rwork = ascend_std::ascend_buf_alloc(d);
let mut i: u32 = 0;
while i < n {
// Load input vector x[i]
let x_ptr = input.wrapping_add((i * d) as usize);
ascend_std::ascend_buf_load_f32(buf_x, x_ptr, d);
ascend_std::ascend_pipe_barrier();
// Find nearest codebook entry (L2 argmin)
let mut best_k: u32 = 0;
let mut best_dist: f32 = f32::MAX;
let mut j: u32 = 0;
while j < k {
let c_ptr = codebook.wrapping_add((j * d) as usize);
ascend_std::ascend_buf_load_f32(buf_c, c_ptr, d);
ascend_std::ascend_pipe_barrier();
// diff = x - c
ascend_std::ascend_sub_f32(buf_diff, buf_x, buf_c, d);
ascend_std::ascend_pipe_barrier();
// diff² = diff * diff
ascend_std::ascend_mul_f32(buf_diff, buf_diff, buf_diff, d);
ascend_std::ascend_pipe_barrier();
// dist = sum(diff²)
let dist = ascend_std::ascend_reduce_sum_f32(
buf_work, buf_diff, buf_rwork, d);
if dist < best_dist {
best_dist = dist;
best_k = j;
}
j += 1;
}
// Output: quantized = codebook[best_k]
let best_ptr = codebook.wrapping_add((best_k * d) as usize);
ascend_std::ascend_buf_load_f32(buf_c, best_ptr, d);
ascend_std::ascend_pipe_barrier();
let out_ptr = output.wrapping_add((i * d) as usize);
ascend_std::ascend_buf_store_f32(out_ptr, buf_c, d);
// EMA scatter-add: cb_sum[best_k] += x[i], cb_count[best_k] += 1
let sum_ptr = cb_sum.wrapping_add((best_k * d) as usize);
let sum_buf = ascend_std::ascend_buf_alloc(d);
ascend_std::ascend_buf_load_f32(sum_buf, sum_ptr, d);
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_add_f32(sum_buf, sum_buf, buf_x, d);
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_buf_store_f32(sum_ptr, sum_buf, d);
let count_val = *cb_count.wrapping_add(best_k as usize);
*cb_count.wrapping_add(best_k as usize) = count_val + 1;
i += 1;
}
}
}
This buffer-API kernel runs on the Ascend AIV backend via rustc_codegen_mlir, and avoids materializing the N×K distance matrix and K-element index array. No tile-API safe::tile_vq_quantize_f32 currently exists — tile-API lowerings on all 9 backends (Ascend AIV / CUDA / Apple Metal / Vulkan SPIR-V / AWS NKI / AMD AIE / Cambricon BANG / Intel Gaudi / Google TPU) are future work. Cross-backend VQ today uses vendor kernels (aclnnMatmul, MPS GEMM, torch.cdist) with a separate argmin pass rather than the fused Rust kernel shown above.
Benchmark configurations
| Shape (N, K, D) | FLOPs | Notes |
|---|---|---|
| (256, 512, 64) | 16.8 M | Small codebook, low-latency inference |
| (1024, 512, 64) | 67.1 M | Typical VQ-VAE batch |
| (1024, 1024, 128) | 268 M | Large codebook, high-dim embeddings |
| (4096, 512, 64) | 268 M | Large batch training |
All benchmarks use f32.
Results
| Device | Shape (N, K, D) | Latency (μs) | GFLOPS | Notes |
|---|---|---|---|---|
| Ascend 910B | (4096, 1024, 128) | 94 | 11,411 | aclnnMatmul L2 trick |
| Ascend 910B | (1024, 1024, 128) | 31 | 8,604 | Large codebook |
| Ascend 910B | (4096, 512, 64) | 43 | 6,243 | Large batch |
| Apple M2 Max | (4096, 1024, 128) | 646 | 1,662 | MPS GEMM + CPU argmin |
| Apple M2 Max | (8192, 512, 64) | 450 | 1,193 | Large batch |
| Tesla T4 | (4096, 1024, 128) | 1,163 | 923 | torch.cdist + argmin |
| Tesla T4 | (1024, 1024, 128) | 494 | 544 | torch.cdist + argmin |
| Tesla T4 | (4096, 512, 64) | 624 | 430 | torch.cdist + argmin |
Peak: 11.4 TFLOPS on Ascend 910B (cube engine via L2 distance matmul trick). Apple M2 Max peaks at 1.7 TFLOPS via MPS. Tesla T4 peaks at 923 GFLOPS via torch.cdist.
See Leaderboard filtered to VQ Quantize for the full filterable view.
Financial Sidecar
Real-time context for xPU investment and procurement decisions.
Stock prices (AI chip vendors)
| Ticker | Company | Role |
|---|---|---|
| NVDA | NVIDIA | GPU market leader |
| AMD | AMD | MI300X, CDNA competitor |
| AAPL | Apple | M-series, Metal ecosystem |
| INTC | Intel | Gaudi, Habana |
| GOOG | TPU, custom silicon | |
| AMZN | Amazon | Trainium, Inferentia |
Device street prices
Tracking real-world prices (not MSRP) helps compute true cost-effectiveness:
| Device | MSRP | Street Price | Source |
|---|---|---|---|
| NVIDIA H100 SXM | $30,000 | Check latest | eBay, broker |
| NVIDIA A100 80GB | $10,000 | Check latest | eBay, broker |
| AMD MI300X | $15,000 | Check latest | AMD direct |
| Apple M4 Max (laptop) | $3,999 | Check latest | Apple Store |
Commodity reference
| Symbol | Relevance |
|---|---|
| Gold (XAU) | Store-of-value benchmark |
| Oil (WTI) | Energy cost proxy |
| BTC | Crypto mining demand affects GPU pricing |
| USD/CNY | Huawei/Cambricon pricing |
Price data updated weekly via scripts/fetch_prices.py.
Submit Results
CSV format
Create a CSV file named <device-slug>.csv with these columns:
device_id,kernel_id,dtype,input_shape,batch_size,impl_lang,latency_us,driver_version,toolchain,git_sha,submitter
nvidia-h100-sxm,softmax,f32,"[64, 1024]",1,cuda,12.3,CUDA 12.4,nvcc 12.4,abc1234,your-name
Steps
- Fork the pu-rs.org repo
- Add your CSV to
submissions/ - Open a pull request
- CI validates format and sanity checks
- Maintainers review and merge
Requirements
- Minimum 20 runs per (kernel, shape) pair
- Report median latency
- Include driver version and toolchain
- Device must exist in
db/seed_devices.sql(or add it in the same PR)
Running the benchmark
All benchmark scripts live in this repo under scripts/.
# Metal (Apple Silicon)
# Requires: ascend_metal_kernels Python module
# (build: cd ascend-rs/crates/ascend_metal_py && maturin develop --release)
ASCEND_METAL_KERNELS=1 python3 scripts/bench_metal.py --device apple-m2-max-38
ASCEND_METAL_KERNELS=1 python3 scripts/bench_metal.py --device apple-m4-max-40 -o submissions/m4-max.csv
# Ascend NPU (Huawei 910B/910C)
# Requires: CANN SDK + ascend-rs repo cloned locally
bash scripts/bench_ascend.sh --device huawei-910b
bash scripts/bench_ascend.sh --device huawei-910c --only softmax --ascend-rs ~/ascend-rs
Supported backends
| Backend | Script | Prerequisites |
|---|---|---|
| Apple Metal | scripts/bench_metal.py | ascend_metal_kernels Python module (build instructions) |
| Huawei Ascend | scripts/bench_ascend.sh | CANN SDK + ascend-rs repo |
| NVIDIA CUDA | scripts/bench_cuda.py | Planned |
| AMD ROCm | scripts/bench_rocm.py | Planned |