Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

pu-rs.org – Processing Unit Ranking System

The SPECfp for AI accelerators.

FLOPS don’t tell the full story. A chip rated at 1000 TFLOPS means nothing if your softmax kernel only achieves 5% utilization. pu-rs.org measures what matters: actual kernel execution time on real hardware, for the operations that AI workloads actually run.

Why this exists

What we measureWhat others report
Softmax latency at (64, 4096) f16Peak TFLOPS
LayerNorm throughput per wattMemory bandwidth (theoretical)
MatMul efficiency vs rooflineMarketing benchmarks
Cost per real GOPSCloud $/hour (opaque)

Scope

We benchmark the kernel primitives that compose every AI model:

CategoryKernels
ActivationSoftmax, GELU, SiLU
NormalizationLayerNorm, RMSNorm
Linear AlgebraGEMM, batched MatMul
AttentionScaled Dot-Product Attention
QuantizationVQ-Quantize, INT8 dequant
ConvolutionConv1D, dilated Conv1D
ReductionScatter-add, L1-smooth loss

Devices covered

TypeVendors
GPUNVIDIA (A100, H100, H200, B200), AMD (MI300X), Apple (M2/M4 Max), Cambricon
TPUGoogle (v5e, v6e Trillium)
NPUHuawei Ascend (910B, 910C), AWS Trainium2, Intel Gaudi 3

How it works

  1. Run standardized benchmark scripts on your hardware
  2. Submit CSV results via pull request
  3. CI validates format and sanity checks
  4. Leaderboard updates automatically with per-kernel rankings

All results tagged with git SHA, driver version, toolchain, and number of runs. Median latency reported. Full methodology.

End-to-end complement

Per-kernel latency is only half the story — a chip can win on softmax and still lose on a real model. The DeepSeek decode page reports end-to-end throughput across five accelerators (Ascend 910B2, TPU v2-8, Apple M2 Max, NVIDIA T4, AWS Trainium1) from the same 13-kernel Rust source emitted through the ascend-rs MLIR backends.


Built with ascend-rs kernel infrastructure. Data updated weekly.

Leaderboard

# Vendor Device Type Kernel Dtype Shape Latency (us) GOPS GOPS/$ GOPS/W Verified

DeepSeek Decode Throughput (Cross-Vendor)

Per-kernel latency (the rest of this site) is the microbenchmark view. This page reports the complementary end-to-end decode throughput for a full DeepSeek-R1-Distill-Qwen-1.5B decode step, measured on five production accelerators from the same 13-kernel Rust source emitted through ascend-rs’s per-vendor MLIR backends.

Headline

RankDeviceBackendEmitted languageDecode tok/s
1Huawei Ascend 910B2mlir_to_cpp+mlir_to_ptoAscendC C++ + PTO-MLIR168.9
2Google TPU v2-8 (Colab)mlir_to_tpuPallas162.9
3Apple M2 Maxmlir_to_mslMetal91.7
4NVIDIA T4 (Colab)mlir_to_gpuCUDA53.7
5AWS Trainium1 (trn1.2xlarge)mlir_to_nkiNKI Python12.2
CPU referenceplain Rust3.7

All five numbers come from identical Rust kernel sources. The MLIR → vendor backend is the only thing that changes between rows. 168.9 tok/s on 910B2 is 2.47× the aclnn-only baseline and 45.6× the CPU reference.

Why this complements the kernel leaderboard

The kernel leaderboard tells you which chip runs a given softmax or GEMM fastest in isolation. Decode tok/s tells you what a real inference workload actually achieves once those kernels are composed with host-side launch overhead, KV cache traffic, and HBM pressure. A chip can win per-kernel and still lose on decode (the Trainium row is the clearest example — strong per-op latency, 9.5% bandwidth utilisation end-to-end).

Notes per device

  • Ascend 910B2 — the +pto half of the joint path contributes the four decode matmul shapes (1.75×–2.98× vs aclnn). RMSNorm stays on CPU; every other op is on-NPU. See the main ascend-rs blog ch10 for the per-kernel breakdown.
  • TPU v2-8 — Colab-visible Pallas on a 4-chip v2 pod; the emitted kernel set is rms_norm, matvec_f16, and an attention fusion.
  • Apple M2 Max — emitted Metal beats Apple’s hand-tuned MLX on decode at this model size. Measured via deepseek_metal.
  • NVIDIA T4 — Colab Tesla T4; same three kernels as the TPU row, emitted in CUDA. 53.7 tok/s is below-roofline (T4 is HBM-bandwidth-starved for this model shape).
  • AWS Trainium1trn1.2xlarge. Six emitted NKI kernels (rms_norm, three matvec_f16 variants, gate_up_silu). Traced via torch_neuronx.trace in two halves (eager single-NEFF runs at 2.5 tok/s — 5× slower — because the single compile unit can’t pipeline across the whole decode path). Trace time: 461 s; wall time: 5.23 s for 64 decode steps.

Reproduction

Each number is reproducible with the commands documented in the per-device sections of ascend-rs ch10. The source kernels live at crates/ascend_std/src/tile.rs in the public repo; the per-vendor emitters live in crates/rustc_codegen_mlir/src/mlir_to_*.rs.


The per-kernel leaderboard remains the authoritative view for isolated-kernel efficiency. Decode throughput is reported here as the complementary end-to-end metric.

Cost Effectiveness

The most important metric for deployment decisions: how much real performance do you get per dollar and per watt?

# Device Kernel Latency (us) MSRP ($) TDP (W) GOPS/$ GOPS/W

Methodology

Measurement protocol

  1. Warmup: 50 iterations discarded
  2. Measurement: 500 iterations, median latency reported
  3. Amortization: Dispatch overhead amortized by batching 500 kernel launches into one command buffer where supported
  4. Isolation: Benchmarks run on idle systems, no background GPU workloads

What we measure

Kernel-only time: the GPU/NPU execution time for a single kernel dispatch, excluding:

  • Host-to-device data transfer (data assumed resident)
  • Command buffer creation overhead (amortized)
  • Python/framework overhead

This isolates the hardware+compiler efficiency from the software stack.

Reporting

MetricDefinition
Latency (us)Median kernel execution time in microseconds
GOPSThroughput: operations / latency
GOPS/$Throughput / device MSRP in USD
GOPS/WThroughput / TDP in watts

Standardized configurations

Each kernel is benchmarked at these canonical shapes:

KernelShapesDtypes
Softmax(1,1024), (64,1024), (64,4096)f32, f16
LayerNorm(1,768), (64,768), (1024,768)f32, f16
GEMM(1024,1024,1024), (4096,4096,4096)f32, f16, bf16
Attention(1,32,128,128), (32,32,2048,128)f32, f16

How to submit

See Submit Results.

GEMM (MatMul)

Category: Linear Algebra | Complexity: O(M·K·N) | Compute: Cube/Tensor Core bound

Algorithm

Dense matrix multiplication: C[M×N] = A[M×K] × B[K×N].

The fundamental ML primitive — dominates runtime in transformers (linear projections, attention scores, FFN layers). Performance depends on tiling strategy, memory hierarchy utilization, and hardware matrix units (cube engines, tensor cores).

ascend-rs Kernel Source

Matrix multiplication in ascend-rs uses the tile API, which compiles to hardware-specific matmul units (cube engine on Ascend, tensor cores on CUDA, etc.).

Safe entry form — kernel body is pure safe Rust, shape committed at the type level:

#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[ascend_std::aiv_kernel]
pub fn tile_matmul(
    a:      GmView<'_, M, K, f32>,
    b:      GmView<'_, K, N, f32>,
    output: GmViewMut<'_, M, N, f32>,
) {
    let at = tile_load_view_f32(&a);
    let bt = tile_load_view_f32(&b);
    let c  = safe::tile_matmul_f32(at, bt);
    tile_store_view_f32(&output, c);
}
}

No unsafe blocks in the body. A mismatched K between the two operands is a compile-time error. The #[aiv_kernel] attribute rewrites the emitted signature to raw *const f32 / *mut f32 so the launcher toolchain sees an unchanged C ABI — #[repr(transparent)] makes the rewrite free at the LLVM IR level.

This compiles via rustc_codegen_mlir → MLIR → target code on all 9 backends:

  • Ascend AIV: PTO-MLIR pto.tmatmul → cube engine (320 TFLOPS f16 on 910B)
  • CUDA: __shared__ tiled GEMM with __syncthreads()
  • Apple Metal / Vulkan SPIR-V: compute shader with shared-memory tiling
  • AWS NKI (Trainium): nki.isa.nc_matmul
  • AMD AIE: AIE2P cascade matmul
  • Cambricon BANG: __bang_matmul on MLU tensor units
  • Intel Gaudi: HPU matmul intrinsic
  • Google TPU: XLA dot_general via OpenXLA

For benchmarking, vendor-optimized libraries are used: aclnnMatmul (Ascend), cuBLAS (CUDA), MPSMatrixMultiplication (Metal).

Benchmark configurations

Shape (A × B)FLOPsNotes
[1024, 1024] × [1024, 1024]2.1 GSmall square, tests dispatch overhead
[4096, 4096] × [4096, 4096]137 GStandard benchmark, bandwidth→compute transition
[8192, 8192] × [8192, 8192]1.1 TLarge square, saturates compute units
[16384, 16384] × [16384, 16384]8.8 TFull hardware saturation
[1024, 4096] × [4096, 1024]8.6 GRectangular, typical FFN down-projection
[4096, 1024] × [1024, 4096]34.4 GRectangular, typical FFN up-projection
[2048, 8192] × [8192, 2048]67.1 GTransformer-scale attention projection

All benchmarks use f16 input with f16 output (or f32 accumulation where supported).

Results

DeviceShapeLatency (μs)TFLOPSGOPS/W
Ascend 910B[4096²]×[4096²]437314.51014
Ascend 910B[8192²]×[8192²]3,614304.2981
Ascend 910B[16384²]×[16384²]27,467320.21033
Ascend 910B[2048, 8192]×[8192, 2048]245280.0903
Ascend 910B[4096, 1024]×[1024, 4096]132260.1839
Apple M2 Max[4096²]×[4096²]17,3747.953
Apple M2 Max[8192²]×[8192²]139,5967.953
Apple M2 Max[2048, 8192]×[8192, 2048]8,9727.751
Apple M2 Max[4096, 1024]×[1024, 4096]4,3457.953
Tesla T4[4096²]×[4096²]5,69824.1345
Tesla T4[8192²]×[8192²]44,09924.9356
Tesla T4[2048, 8192]×[8192, 2048]2,56726.8383
Tesla T4[4096, 1024]×[1024, 4096]1,54922.2317

Peak: 320 TFLOPS (f16) on Ascend 910B — saturating the theoretical maximum. Tesla T4 peaks at 26.8 TFLOPS (f16) via cuBLAS (torch.matmul). Apple M2 Max peaks at 7.9 TFLOPS (f16) via MPSMatrixMultiplication.

See Leaderboard filtered to MatMul for the full filterable view.

Attention (Scaled Dot-Product)

Category: Attention | Complexity: O(B·H·S²·D) | Compute: Cube/Tensor Core bound

Algorithm

Scaled dot-product attention: Output = softmax(Q·K^T / √d) · V

The core transformer primitive — computes attention weights from queries and keys, applies softmax normalization, then produces a weighted sum of values. Dominates runtime in all transformer architectures (GPT, BERT, LLaMA, etc.).

Pipeline:

  1. Scores = Q × K^T — matmul (S×D) × (D×S) → (S×S)
  2. Scale by 1/√d — element-wise multiply
  3. Softmax along last axis — numerically stable (max → sub → exp → sum → div)
  4. Output = Weights × V — matmul (S×S) × (S×D) → (S×D)

ascend-rs Kernel Source

The attention pipeline in ascend-rs combines tile-API matmul with custom Rust kernels for scale and softmax:

#![allow(unused)]
fn main() {
use ascend_rs::prelude::*;

let scale = 1.0f32 / (d_k as f32).sqrt();

// Step 1: scores = Q × K^T  (HGEMM via cube engine)
acl_blas_hgemm(TransN, TransT, TransN,
    seq_len, seq_len, d_k,
    &alpha, &d_q, d_k, &d_k_mat, d_k,
    &beta, &mut d_scores, seq_len,
    HighPrecision, &stream)?;

// Step 2: scores *= 1/√d_k  (custom Rust kernel → NPU)
scale_kernel.launch(1, &stream, &mut [
    d_scores.as_mut_ptr(),  // in-place
    d_scores.as_mut_ptr(),  // output (same buffer)
    d_n_scores.as_mut_ptr(),
    d_scale.as_mut_ptr(),
])?;

// Step 3: weights = softmax(scores)  (custom Rust kernel → NPU)
softmax_kernel.launch(1, &stream, &mut [
    d_scores.as_mut_ptr(),
    d_weights.as_mut_ptr(),
    d_row_len.as_mut_ptr(),
    d_num_rows.as_mut_ptr(),
])?;

// Step 4: output = weights × V  (HGEMM via cube engine)
acl_blas_hgemm(TransN, TransN, TransN,
    seq_len, d_k, seq_len,
    &alpha, &d_weights, seq_len, &d_v, d_k,
    &beta, &mut d_output, d_k,
    HighPrecision, &stream)?;
}

The scale and softmax kernels are written in Rust and compiled via rustc_codegen_mlir → MLIR → backend code. The GEMMs use vendor-optimized libraries (aclnnMatmul, cuBLAS, MPSMatrixMultiplication).

Backend status for the fused safe::tile_attention_f32 tile op: Ascend AIV, Cambricon BANG, Intel Gaudi, Apple Metal, Vulkan SPIR-V (5/9). CUDA / AWS NKI / AMD AIE / Google TPU lowerings are TODO — on those backends the pipeline still runs as separate matmul + softmax + matmul dispatches using the individually-lowered tile ops.

Benchmark configurations

Shape (B, H, S, D)FLOPsNotes
(1, 1, 128, 64)4.2 MSmall baseline, dispatch overhead test
(1, 1, 512, 64)67 MMedium sequence
(1, 1, 1024, 64)268 MGPT-2 scale
(1, 1, 2048, 64)1.1 GLong context
(1, 1, 4096, 64)4.3 GVery long context (quadratic scaling)
(1, 8, 512, 64)537 M8-head, GPT-2 like
(1, 12, 512, 64)805 M12-head, BERT-base
(1, 32, 512, 64)2.1 G32-head, LLaMA-7B
(1, 32, 1024, 128)17.2 G32-head, LLaMA-2-7B
(1, 32, 2048, 128)68.7 G32-head, long context

All benchmarks use f16 input with f16 output. FLOPs ≈ 4·B·H·S²·D (two matmuls dominate).

Results

DeviceShape (B,H,S,D)Latency (μs)TFLOPSNotes
Ascend 910B(1,32,1024,128)31055.4aclnnMatmul+Softmax, manual pipeline
Ascend 910B(1,32,2048,128)1,45947.1Memory-bound at long context
Ascend 910B(1,1,4096,64)14928.9Single-head, large S
Ascend 910B(1,32,512,64)10520.432-head, short context
Tesla T4(1,32,1024,128)2,6096.6F.scaled_dot_product_attention
Tesla T4(1,32,2048,128)5,06713.6Flash attention backend
Tesla T4(1,1,4096,64)9744.4Single-head
Tesla T4(1,32,512,64)4275.032-head
Apple M2 Max(1,32,1024,128)138,8190.12MPS GEMM + CPU softmax
Apple M2 Max(1,1,4096,64)60,6470.07MPS GEMM + CPU softmax

Peak: 55.4 TFLOPS (f16) on Ascend 910B. Tesla T4 peaks at 13.6 TFLOPS (f16) via PyTorch SDPA. Apple M2 Max peaks at 0.14 TFLOPS — bottlenecked by CPU softmax (no fused MPS attention).

See Leaderboard filtered to Attention for the full filterable view.

Softmax

Category: Activation | Complexity: O(N) per row | Memory: 2 passes over input

Algorithm

The online 2-pass softmax (Milakov & Gimelshein 2018):

Pass 1 (single traversal): Maintain running (max, sum) pair per thread. When a new maximum is found, rescale the accumulated sum:

sum_new = sum_old * exp(max_old - max_new) + exp(x - max_new)

Pass 2: Write exp(x - global_max) / global_sum per element.

This is 33% less memory traffic than the naive 3-pass algorithm (max, exp+sum, normalize).

ascend-rs Kernel Source

Softmax in ascend-rs uses the buffer API for element-wise backends and the tile API for matrix-oriented backends:

Scalar kernel (f32, benchmarked implementation):

#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub fn softmax(input: *const f32, output: *mut f32, len: *const u32) {
    let n = *len as usize;

    // Step 1: Find max for numerical stability
    let mut max_val = *input;
    let mut i = 1usize;
    loop {
        if i >= n { break; }
        let val = *input.wrapping_add(i);
        if val > max_val { max_val = val; }
        i += 1;
    }

    // Step 2: exp(x - max) and accumulate sum
    let mut sum: f32 = 0.0;
    i = 0;
    loop {
        if i >= n { break; }
        let exp_val = (*input.wrapping_add(i) - max_val).exp();
        *output.wrapping_add(i) = exp_val;
        sum += exp_val;
        i += 1;
    }

    // Step 3: Normalize
    i = 0;
    loop {
        if i >= n { break; }
        *output.wrapping_add(i) = *output.wrapping_add(i) / sum;
        i += 1;
    }
}
}

Tile API — safe entry form (lowered by rustc_codegen_mlir to all 9 backends: Ascend AIV, CUDA, Apple Metal, Vulkan SPIR-V, AWS NKI, AMD AIE, Cambricon BANG, Intel Gaudi, Google TPU):

#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[ascend_std::aiv_kernel]
pub fn tile_softmax(
    input:  GmView<'_, 1, 1024, f32>,
    output: GmViewMut<'_, 1, 1024, f32>,
) {
    let x = tile_load_view_f32(&input);
    let y = safe::tile_softmax_f32(x);
    tile_store_view_f32(&output, y);
}
}

The kernel body is pure safe Rust — no unsafe blocks. Shape (rows, cols, dtype) is committed at the type level via const generics, so any host-side mismatch becomes a compile-time error. The #[aiv_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 so the launcher/compiler toolchain (bisheng / ACL / nvcc) sees the same C ABI. #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level — the two forms emit as literal symbol aliases.

Kernels compile via rustc_codegen_mlir → MLIR → target-specific code. Softmax is one of the four “hot path” tile ops (alongside matmul, rms-norm, silu) that is lowered on every backend currently targeted.

Benchmark configurations

ShapeElementsBytes (f32)Notes
(1, 1024)1K4 KBL1-resident, tests dispatch overhead
(64, 1024)64K256 KBL2-resident, typical batch
(64, 4096)256K1 MBBandwidth-bound regime

Results

See Leaderboard filtered to Softmax for the full filterable view.

Causal Mask

Category: Masking | Complexity: O(S^2) elementwise | Memory: 1 pass (read+write)

Algorithm

Causal masking sets the upper triangle of the attention score matrix to negative infinity, preventing tokens from attending to future positions:

For i, j in [0..S) x [0..S):
  if j > i:  scores[i,j] = -inf
  else:      scores[i,j] = scores[i,j]

Applied between Q@K^T and softmax in autoregressive (decoder) attention:

scores = Q @ K^T / sqrt(d)
scores = causal_mask(scores)   <-- this kernel
weights = softmax(scores)

This is memory-bandwidth bound (simple conditional copy), but critical for correctness in all decoder-only models (GPT, LLaMA, etc.).

ascend-rs Kernel Source

Causal mask using the tile API — safe entry form:

#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[ascend_std::aiv_kernel]
pub fn tile_causal_mask(
    input:  GmView<'_, 64, 64, f32>,
    output: GmViewMut<'_, 64, 64, f32>,
) {
    let scores = tile_load_view_f32(&input);
    let masked = safe::tile_causal_mask_f32(scores);
    tile_store_view_f32(&output, masked);
}
}

The kernel body is pure safe Rust — shape (rows, cols, dtype) is committed at the type level via const generics, so any host-side mismatch becomes a compile-time error. Square-shape enforcement (rows == cols) is also enforced at the type level. The #[aiv_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 so the launcher toolchain sees the same C ABI; #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level.

Backend status (lowered by rustc_codegen_mlir): Cambricon BANG, Intel Gaudi, Apple Metal, Vulkan SPIR-V (4/9). Ascend AIV / CUDA / AWS NKI / AMD AIE / Google TPU lowerings are TODO — on those backends causal masking is currently applied as a buffer-API element-wise compare-and-select rather than a single fused tile op.

Benchmark configurations

Shape (S, S)ElementsBytes (f32)Notes
(64, 64)4K16 KBSmall attention window
(128, 128)16K64 KBStandard context
(256, 256)65K256 KBMedium context
(512, 512)262K1 MBLong context

Results

See Leaderboard filtered to Causal Mask for the full filterable view.

RoPE (Rotary Position Embedding)

Category: Positional Encoding | Complexity: O(S*D) elementwise | Memory: 2 passes (read+write, plus cos/sin tables)

Algorithm

RoPE (Su et al. 2021) encodes position by rotating pairs of dimensions at frequency-dependent rates:

For each pair (x[2i], x[2i+1]):
  theta = pos / 10000^(2i/d)
  x'[2i]   = x[2i]*cos(theta) - x[2i+1]*sin(theta)
  x'[2i+1] = x[2i]*sin(theta) + x[2i+1]*cos(theta)

Used in every modern LLM (LLaMA, Mistral, GPT-NeoX, Qwen, etc.) to encode token position in Q/K vectors. RoPE is bandwidth-bound for short sequences and compute-bound (cos/sin) for long sequences.

ascend-rs Kernel Source

RoPE using the tile API — safe entry form:

#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[ascend_std::aiv_kernel]
pub fn tile_rope(
    input:  GmView<'_, 1, 128, f32>,
    output: GmViewMut<'_, 1, 128, f32>,
) {
    let x = tile_load_view_f32(&input);
    let y = safe::tile_rope_f32(x, 0);  // base position = 0
    tile_store_view_f32(&output, y);
}
}

The kernel body is pure safe Rust — shape (rows, cols, dtype) is committed at the type level via const generics, so any host-side mismatch becomes a compile-time error. The #[aiv_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 so the launcher toolchain sees the same C ABI; #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level.

Backend status (lowered by rustc_codegen_mlir): Cambricon BANG, Intel Gaudi, Apple Metal, Vulkan SPIR-V (4/9). Ascend AIV / CUDA / AWS NKI / AMD AIE / Google TPU lowerings are TODO — on those backends RoPE is currently expressed as a buffer-API composition of element-wise cos/sin/mul/add rather than a single fused tile op.

Benchmark configurations

Shape (B, S, D)ElementsBytes (f32)Notes
(1, 64, 128)8K32 KBSingle query, short context
(32, 64, 128)262K1 MBBatched queries
(1, 128, 128)16K64 KBLonger head dim

Results

See Leaderboard filtered to RoPE for the full filterable view.

LayerNorm

Category: Normalization | Complexity: O(N) per row | Memory: 3 passes

Algorithm

3-pass fused: mean, variance, normalize+affine in one workgroup:

  1. Mean: Parallel sum reduction, divide by N
  2. Variance: Parallel sum of (x - mean)^2, compute inverse std
  3. Affine: gamma * (x - mean) * inv_std + beta

Uses SIMD group shuffles for warp-level reductions (1 threadgroup barrier instead of 8).

ascend-rs Kernel Source

LayerNorm in ascend-rs using vectorized AscendC intrinsics (f32, benchmarked implementation):

#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub fn layernorm(input: *const f32, output: *mut f32, len_buf: *const u32) {
    let n = *len_buf;
    let eps = 1.0e-5f32;

    let in_buf = ascend_std::ascend_buf_alloc(n);
    let out_buf = ascend_std::ascend_buf_alloc(n);
    let work = ascend_std::ascend_buf_alloc(n);
    let rwork = ascend_std::ascend_buf_alloc(n);

    // DMA load: GM -> local buffer
    ascend_std::ascend_buf_load_f32(in_buf, input, n);
    ascend_std::ascend_pipe_barrier();

    // Step 1: mean = sum(x) / n
    let sum_val = ascend_std::ascend_reduce_sum_f32(work, in_buf, rwork, n);
    let mean = sum_val / (n as f32);

    // Step 2: centered = x - mean
    ascend_std::ascend_adds_f32(out_buf, in_buf, -mean, n);
    ascend_std::ascend_pipe_barrier();

    // Step 3: var = sum((x - mean)^2) / n
    ascend_std::ascend_mul_f32(work, out_buf, out_buf, n);
    ascend_std::ascend_pipe_barrier();
    let var_sum = ascend_std::ascend_reduce_sum_f32(work, work, rwork, n);
    let inv_std = 1.0 / (var_sum / (n as f32) + eps).sqrt();

    // Step 4: output = centered * inv_std
    ascend_std::ascend_muls_f32(out_buf, out_buf, inv_std, n);

    ascend_std::ascend_pipe_barrier();
    ascend_std::ascend_buf_store_f32(output, out_buf, n);
}
}

This buffer-API kernel is the primary implementation and runs on the Ascend AIV backend. A tile-API safe::tile_layernorm_f32 variant is additionally lowered by rustc_codegen_mlir to Apple Metal (1/9) — the other 8 backend lowerings (Ascend AIV / CUDA / Vulkan SPIR-V / AWS NKI / AMD AIE / Cambricon BANG / Intel Gaudi / Google TPU) are future work. On non-Metal backends, LayerNorm is currently composed at the buffer API as shown above (mean → sub → mul² → mean → sqrt → mul) rather than emitted as a single tile op.

Benchmark configurations

ShapeNotes
(1, 768)GPT-2 hidden dim, single position
(64, 768)Typical batch
(1024, 768)Large batch

Results

See Leaderboard filtered to LayerNorm for the full filterable view.

RMS Norm

Category: Normalization | Complexity: O(N) per row | Memory: 2 passes

Algorithm

RMSNorm (Zhang & Sennrich 2019) is a simplified LayerNorm used in LLaMA, Gemma, and most modern LLMs. It omits the mean-centering step:

  1. RMS: Compute root-mean-square: rms = sqrt(mean(x²) + ε)
  2. Normalize + Scale: y = (x / rms) * gamma

Compared to LayerNorm, RMSNorm saves one reduction pass (no mean computation) and one elementwise subtraction, yielding ~15% faster inference at equal accuracy.

ascend-rs Kernel Source

RMS Norm using ascend-rs buffer API (f32):

#![allow(unused)]
fn main() {
/// RMS Norm: y[i] = (x[i] / rms) * gamma[i]
/// where rms = sqrt(mean(x²) + eps)
///
/// params: [n: u32]
#[ascend_std::aiv_kernel]
pub fn rms_norm(
    input: *const f32,
    gamma: *const f32,
    output: *mut f32,
    params: *const u32,
) {
    let n = *params;
    let eps = 1.0e-5f32;

    let in_buf = ascend_std::ascend_buf_alloc(n);
    let gamma_buf = ascend_std::ascend_buf_alloc(n);
    let work = ascend_std::ascend_buf_alloc(n);
    let rwork = ascend_std::ascend_buf_alloc(n);

    // Load input and gamma
    ascend_std::ascend_buf_load_f32(in_buf, input, n);
    ascend_std::ascend_buf_load_f32(gamma_buf, gamma, n);
    ascend_std::ascend_pipe_barrier();

    // Step 1: x² → work
    ascend_std::ascend_mul_f32(work, in_buf, in_buf, n);
    ascend_std::ascend_pipe_barrier();

    // Step 2: rms = sqrt(mean(x²) + eps)
    let sq_sum = ascend_std::ascend_reduce_sum_f32(work, work, rwork, n);
    let inv_rms = 1.0 / (sq_sum / (n as f32) + eps).sqrt();

    // Step 3: output = (x * inv_rms) * gamma
    ascend_std::ascend_muls_f32(work, in_buf, inv_rms, n);
    ascend_std::ascend_pipe_barrier();
    ascend_std::ascend_mul_f32(work, work, gamma_buf, n);

    ascend_std::ascend_pipe_barrier();
    ascend_std::ascend_buf_store_f32(output, work, n);
}
}

This buffer-API kernel runs on the Ascend AIV backend. A tile-API safe::tile_rms_norm_f32 variant is additionally lowered by rustc_codegen_mlir to all 9 backends (Ascend AIV, CUDA, Apple Metal, Vulkan SPIR-V, AWS NKI, AMD AIE, Cambricon BANG, Intel Gaudi, Google TPU) — RMS Norm is one of the four “hot path” tile ops (alongside matmul, softmax, silu) that is lowered on every backend currently targeted.

Benchmark configurations

ShapeElementsBytes (f32)Notes
(1, 768)7683 KBGPT-2 hidden dim, single token
(1, 4096)4K16 KBLLaMA-7B hidden dim
(64, 768)49K192 KBTypical batch, GPT-2
(64, 4096)262K1 MBTypical batch, LLaMA
(1024, 4096)4.2M16 MBLarge batch, bandwidth-bound

All benchmarks use f32.

Results

See Leaderboard filtered to RMS Norm for the full filterable view.

GELU

Category: Activation | Complexity: O(N) elementwise | Memory: 1 pass (fused read+write)

Algorithm

GELU (Gaussian Error Linear Unit, Hendrycks & Gimpel 2016) is the standard activation in BERT, GPT, LLaMA, and most transformer models:

GELU(x) = x · Φ(x) = x · 0.5 · (1 + erf(x / √2))

The fast tanh approximation (used in PyTorch gelu(approximate='tanh')):

GELU(x) ≈ 0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³)))

GELU is memory-bandwidth bound — the compute-to-byte ratio is low (a few FLOPs per 4-byte element), so peak throughput is measured in GB/s rather than TFLOPS.

ascend-rs Kernel Source

GELU using ascend-rs buffer API (f32, tanh approximation):

#![allow(unused)]
fn main() {
/// GELU activation: y = 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
///
/// params: [n: u32]
#[ascend_std::aiv_kernel]
pub fn gelu(
    input: *const f32,
    output: *mut f32,
    params: *const u32,
) {
    let n = *params as usize;
    let sqrt_2_pi: f32 = 0.7978845608; // sqrt(2/pi)
    let coeff: f32 = 0.044715;

    let mut i: usize = 0;
    while i < n {
        let x = *input.wrapping_add(i);
        let x3 = x * x * x;
        let inner = sqrt_2_pi * (x + coeff * x3);
        // tanh via exp: tanh(z) = (e^2z - 1)/(e^2z + 1)
        let e2z = (2.0 * inner).exp();
        let tanh_val = (e2z - 1.0) / (e2z + 1.0);
        *output.wrapping_add(i) = 0.5 * x * (1.0 + tanh_val);
        i += 1;
    }
}
}

Vectorized version using buffer intrinsics:

#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub fn gelu_vec(
    input: *const f32,
    output: *mut f32,
    params: *const u32,
) {
    let n = *params;
    let in_buf = ascend_std::ascend_buf_alloc(n);
    let work = ascend_std::ascend_buf_alloc(n);
    let work2 = ascend_std::ascend_buf_alloc(n);

    ascend_std::ascend_buf_load_f32(in_buf, input, n);
    ascend_std::ascend_pipe_barrier();

    // x³
    ascend_std::ascend_mul_f32(work, in_buf, in_buf, n);
    ascend_std::ascend_pipe_barrier();
    ascend_std::ascend_mul_f32(work, work, in_buf, n);
    ascend_std::ascend_pipe_barrier();
    // 0.044715 * x³
    ascend_std::ascend_muls_f32(work, work, 0.044715, n);
    ascend_std::ascend_pipe_barrier();
    // x + 0.044715 * x³
    ascend_std::ascend_add_f32(work, in_buf, work, n);
    ascend_std::ascend_pipe_barrier();
    // sqrt(2/pi) * (x + 0.044715 * x³)
    ascend_std::ascend_muls_f32(work, work, 0.7978845608, n);
    ascend_std::ascend_pipe_barrier();

    // Store result
    ascend_std::ascend_buf_store_f32(output, work, n);
}
}

These buffer-API kernels run on the Ascend AIV backend via rustc_codegen_mlir. No tile-API safe::tile_gelu_f32 currently exists — tile-API lowerings on all 9 backends (Ascend AIV / CUDA / Apple Metal / Vulkan SPIR-V / AWS NKI / AMD AIE / Cambricon BANG / Intel Gaudi / Google TPU) are future work. Cross-backend execution today goes through the buffer-API scalar loop or the element-wise intrinsic composition shown above.

Benchmark configurations

ShapeElementsBytes (f32)Notes
(1, 768)7683 KBGPT-2 hidden dim
(1, 4096)4K16 KBLLaMA hidden dim
(64, 768)49K192 KBTypical batch
(64, 4096)262K1 MBBandwidth-bound
(1024, 4096)4.2M16 MBLarge batch

All benchmarks use f32.

Results

See Leaderboard filtered to GELU for the full filterable view.

SiLU / Swish

Category: Activation | Complexity: O(N) elementwise | Memory: 1 pass (fused read+write)

Algorithm

SiLU (Sigmoid Linear Unit), also known as Swish (Ramachandran et al. 2017), is the gate activation in LLaMA, Mistral, and most modern LLMs:

SiLU(x) = x * sigmoid(x) = x / (1 + exp(-x))

Used in the FFN block:

FFN(x) = SiLU(W_gate * x) * (W_up * x)

Like GELU, SiLU is memory-bandwidth bound. The compute-to-byte ratio is low (a few FLOPs per element), so throughput is measured in GB/s.

ascend-rs Kernel Source

SiLU using the tile API — safe entry form (lowered by rustc_codegen_mlir to all 9 backends: Ascend AIV, CUDA, Apple Metal, Vulkan SPIR-V, AWS NKI, AMD AIE, Cambricon BANG, Intel Gaudi, Google TPU):

#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[ascend_std::aiv_kernel]
pub fn tile_silu(
    input:  GmView<'_, 1, 4096, f32>,
    output: GmViewMut<'_, 1, 4096, f32>,
) {
    let x = tile_load_view_f32(&input);
    let y = safe::tile_silu_f32(x);
    tile_store_view_f32(&output, y);
}
}

The kernel body is pure safe Rust — shape (rows, cols, dtype) is committed at the type level via const generics, so any host-side mismatch becomes a compile-time error. The #[aiv_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 so the launcher toolchain sees the same C ABI; #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level.

safe::tile_silu_f32 decomposes to: neg → exp → add_scalar(1) → reciprocal → mul with original x. SiLU is one of the four “hot path” tile ops (alongside matmul, softmax, rms-norm) that is lowered on every backend currently targeted.

Benchmark configurations

ShapeElementsBytes (f32)Notes
(1, 768)7683 KBGPT-2 hidden dim
(1, 4096)4K16 KBLLaMA hidden dim
(64, 4096)262K1 MBTypical batch
(1024, 4096)4.2M16 MBLarge batch

Results

See Leaderboard filtered to SiLU for the full filterable view.

Embedding Lookup

Category: Memory Access | Complexity: O(N*D) gather | Memory: Random access (bandwidth-bound)

Algorithm

Embedding lookup gathers rows from a (V, D) weight table by token indices:

For each token index t[i] in [0..V):
  output[i, :] = weight[t[i], :]

This is the first operation in any transformer: tokens (integers) become vectors. It is purely bandwidth-bound with random access patterns, making it a key memory subsystem benchmark.

ascend-rs Kernel Source

Embedding using the tile API — safe entry form with one unsafe block (the indices pointer is an integer gather source, not a tile, so safe::tile_embedding_f32 is declared pub unsafe fn):

#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[ascend_std::aiv_kernel]
pub fn tile_embedding(
    weight:  GmView<'_, 32000, 128, f32>,  // (V, D) codebook
    indices: *const u32,                   // (N,) token ids — integer gather source
    output:  GmViewMut<'_, 32, 128, f32>,  // (N, D) gathered rows
) {
    let w = tile_load_view_f32(&weight);
    // SAFETY: `indices` is a valid *const u32 of length COUNT=32, guaranteed by
    // the launcher. The unsafe wrapper is the only non-safe surface.
    let emb = unsafe { safe::tile_embedding_f32(w, indices) };
    tile_store_view_f32(&output, emb);
}
}

Weight table and output shapes are committed at the type level via const generics (V, D, N), so any host-side mismatch becomes a compile-time error. The #[aiv_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 for the tile params so the launcher toolchain sees the same C ABI; #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level.

Backend status (lowered by rustc_codegen_mlir): Cambricon BANG, Intel Gaudi, Apple Metal, Vulkan SPIR-V. Ascend AIV / CUDA / AWS NKI / AMD AIE / Google TPU lowerings are TODO.

Benchmark configurations

Shape (N, V, D)Output ElementsBytes (f32)Notes
(32, 32000, 128)4K16 KBLLaMA-2 vocab, small dim
(128, 32000, 128)16K64 KBLarger batch
(32, 32000, 4096)131K512 KBFull hidden dim

Results

See Leaderboard filtered to Embedding for the full filterable view.

Cross-Entropy Loss

Category: Loss Function | Complexity: O(N*V) reduction | Memory: 2 passes (max + sum-exp)

Algorithm

Cross-entropy loss is the standard training objective for classification and language modeling:

loss[i] = -logits[i, target[i]] + log(sum(exp(logits[i, :])))

Numerically stable version (log-sum-exp trick):

m = max(logits[i, :])
loss[i] = -(logits[i, target[i]] - m) + log(sum(exp(logits[i, :] - m)))

This kernel is compute-heavy for large vocabularies (V=32000+) due to the row-wise exp and reduction. It combines softmax-like reduction with an index gather.

ascend-rs Kernel Source

Cross-entropy using the tile API — safe entry form with one unsafe block (the targets pointer is an integer gather source, not a tile, so safe::tile_cross_entropy_f32 is declared pub unsafe fn):

#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[ascend_std::aiv_kernel]
pub fn tile_cross_entropy(
    logits:  GmView<'_, 32, 32000, f32>,  // (N, V)
    targets: *const u32,                  // (N,) target class ids — integer gather
    loss:    GmViewMut<'_, 32, 1, f32>,   // (N, 1) per-row loss
) {
    let x = tile_load_view_f32(&logits);
    // SAFETY: `targets` is a valid *const u32 of length R=32, guaranteed by
    // the launcher. The unsafe wrapper is the only non-safe surface.
    let y = unsafe { safe::tile_cross_entropy_f32(x, targets) };
    tile_store_view_f32(&loss, y);
}
}

Logits and loss shapes (and their shared N) are committed at the type level via const generics, so any host-side mismatch becomes a compile-time error. The #[aiv_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 for the tile params so the launcher toolchain sees the same C ABI; #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level.

Backend status (lowered by rustc_codegen_mlir): Cambricon BANG, Intel Gaudi. Ascend AIV / CUDA / Apple Metal / Vulkan SPIR-V / AWS NKI / AMD AIE / Google TPU lowerings are TODO — this is the narrowest backend coverage of any kernel page, reflecting that cross-entropy is primarily a training-loss primitive.

Benchmark configurations

Shape (N, V)ElementsBytes (f32)Notes
(32, 32000)1M4 MBLLaMA-2 vocab, small batch
(128, 32000)4M16 MBLarger batch
(32, 50257)1.6M6.4 MBGPT-2 vocab

Results

See Leaderboard filtered to Cross-Entropy for the full filterable view.

Dilated Conv1D + ReLU

Category: Convolution | Complexity: O(B·T·C²·3) | Fusion: pad + gather + matmul + ReLU

Algorithm

A dilated 1D convolution with kernel size 3, fused with bias and ReLU activation. Used in VQ-VAE encoder/decoder ResConv1DBlocks (e.g., SOKE, Jukebox-style models).

The naive implementation requires:

  1. Pad the input by dilation on each side (zero-padding)
  2. Gather 3 positions per output: [t-d, t, t+d]
  3. Concat along the channel axis -> (B, T, 3C)
  4. Linear projection (3C -> C)
  5. ReLU

This kernel fuses all 5 steps into a single GPU pass, eliminating three intermediate (B, T, 3C) buffer allocations and the data shuffles between them.

Why fusion matters

For an 18-block VQ-VAE encoder/decoder, the unfused version allocates 54 intermediate tensors per forward pass and reads them back. Fusing into one kernel:

  • Eliminates intermediate buffer writes/reads (3x memory bandwidth reduction)
  • Keeps activations in registers/L1 cache between stages
  • One command buffer dispatch instead of five

ascend-rs Kernel Source

Vectorized dilated conv1d + ReLU using ascend-rs buffer API (f32, benchmarked implementation):

#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub fn conv1d_dilated(input: *const f32, output: *mut f32, params: *const u32) {
    let n = *params;
    let dilation = *params.wrapping_add(1);
    let w0 = f32::from_bits(*params.wrapping_add(2));
    let w1 = f32::from_bits(*params.wrapping_add(3));
    let w2 = f32::from_bits(*params.wrapping_add(4));
    let bias = f32::from_bits(*params.wrapping_add(5));

    let aligned_n = ((n + 7) / 8) * 8;
    let in_buf = ascend_std::ascend_buf_alloc(aligned_n);
    let tap_left = ascend_std::ascend_buf_alloc(aligned_n);
    let tap_right = ascend_std::ascend_buf_alloc(aligned_n);
    let acc = ascend_std::ascend_buf_alloc(aligned_n);
    let work = ascend_std::ascend_buf_alloc(aligned_n);

    ascend_std::ascend_buf_load_f32(in_buf, input, n);
    ascend_std::ascend_pipe_barrier();

    // Build shifted tap buffers with zero-padding
    ascend_std::ascend_buf_fill_f32(tap_left, 0.0, aligned_n);
    let mut i = dilation;
    while i < n {
        let v = ascend_std::ascend_get_value_f32(in_buf, i - dilation);
        ascend_std::ascend_set_value_f32(tap_left, i, v);
        i += 1;
    }
    ascend_std::ascend_buf_fill_f32(tap_right, 0.0, aligned_n);
    i = 0;
    while i + dilation < n {
        let v = ascend_std::ascend_get_value_f32(in_buf, i + dilation);
        ascend_std::ascend_set_value_f32(tap_right, i, v);
        i += 1;
    }

    // Vector MAC: acc = tap_left*w0 + input*w1 + tap_right*w2 + bias
    ascend_std::ascend_muls_f32(acc, tap_left, w0, n);
    ascend_std::ascend_muls_f32(work, in_buf, w1, n);
    ascend_std::ascend_add_f32(tap_left, acc, work, n);
    ascend_std::ascend_muls_f32(work, tap_right, w2, n);
    ascend_std::ascend_add_f32(acc, tap_left, work, n);
    ascend_std::ascend_adds_f32(acc, acc, bias, n);
    ascend_std::ascend_maxs_f32(acc, acc, 0.0, n);  // ReLU

    ascend_std::ascend_pipe_barrier();
    ascend_std::ascend_buf_store_f32(output, acc, n);
}
}

This buffer-API kernel runs on the Ascend AIV backend via rustc_codegen_mlir. No tile-API safe::tile_conv1d_f32 currently exists — tile-API lowerings on all 9 backends (Ascend AIV / CUDA / Apple Metal / Vulkan SPIR-V / AWS NKI / AMD AIE / Cambricon BANG / Intel Gaudi / Google TPU) are future work. On non-Ascend backends the fused pad+gather+matmul+ReLU is currently expressed as a buffer-API composition rather than a single tile op.

Benchmark configurations

Shape (B, T, C)ElementsNotes
(2, 50, 512)51 KSingle VQ-VAE block, small batch
(8, 100, 512)410 KMid-sized clip
(2, 400, 512)410 KLong sequence

Results

See Leaderboard filtered to conv1d-dilated for the full filterable view.

VQ Quantize + EMA Update

Category: Quantization | Complexity: O(N·K·D) | Fusion: L2 distance + argmin + scatter-add

Algorithm

Vector quantization maps each input vector to its nearest codebook entry, then updates the codebook via exponential moving average (EMA). Used in VQ-VAE training (SOKE, Jukebox, SoundStream).

Pipeline:

  1. L2 distance: For each input vector x[i] (dim D), compute ||x[i] - c[k]||² against all K codebook entries
  2. Argmin: Find nearest codebook entry k* = argmin_k ||x[i] - c[k]||²
  3. Quantize: Output q[i] = c[k*] (the nearest codebook vector)
  4. EMA scatter-add: Accumulate x[i] into codebook slot k* for EMA update: sum[k*] += x[i], count[k*] += 1

Fusing all 4 steps into one kernel eliminates 3 intermediate buffers (distance matrix, index array, scatter workspace).

ascend-rs Kernel Source

VQ quantize kernel using ascend-rs buffer API (f32):

#![allow(unused)]
fn main() {
/// VQ Quantize: for each input vector, find nearest codebook entry (L2),
/// output the quantized vector, and scatter-add for EMA codebook update.
///
/// params: [n_vectors: u32, n_codes: u32, dim: u32]
#[ascend_std::aiv_kernel]
pub fn vq_quantize(
    input: *const f32,      // (N, D) input vectors
    codebook: *const f32,   // (K, D) codebook
    output: *mut f32,       // (N, D) quantized output
    cb_sum: *mut f32,       // (K, D) EMA numerator accumulator
    cb_count: *mut u32,     // (K,)   EMA denominator counter
    params: *const u32,
) {
    let n = *params;                      // number of input vectors
    let k = *params.wrapping_add(1);      // codebook size
    let d = *params.wrapping_add(2);      // vector dimension

    let buf_x = ascend_std::ascend_buf_alloc(d);     // current input vector
    let buf_c = ascend_std::ascend_buf_alloc(d);     // current codebook entry
    let buf_diff = ascend_std::ascend_buf_alloc(d);  // x - c
    let buf_work = ascend_std::ascend_buf_alloc(d);
    let buf_rwork = ascend_std::ascend_buf_alloc(d);

    let mut i: u32 = 0;
    while i < n {
        // Load input vector x[i]
        let x_ptr = input.wrapping_add((i * d) as usize);
        ascend_std::ascend_buf_load_f32(buf_x, x_ptr, d);
        ascend_std::ascend_pipe_barrier();

        // Find nearest codebook entry (L2 argmin)
        let mut best_k: u32 = 0;
        let mut best_dist: f32 = f32::MAX;

        let mut j: u32 = 0;
        while j < k {
            let c_ptr = codebook.wrapping_add((j * d) as usize);
            ascend_std::ascend_buf_load_f32(buf_c, c_ptr, d);
            ascend_std::ascend_pipe_barrier();

            // diff = x - c
            ascend_std::ascend_sub_f32(buf_diff, buf_x, buf_c, d);
            ascend_std::ascend_pipe_barrier();
            // diff² = diff * diff
            ascend_std::ascend_mul_f32(buf_diff, buf_diff, buf_diff, d);
            ascend_std::ascend_pipe_barrier();
            // dist = sum(diff²)
            let dist = ascend_std::ascend_reduce_sum_f32(
                buf_work, buf_diff, buf_rwork, d);

            if dist < best_dist {
                best_dist = dist;
                best_k = j;
            }
            j += 1;
        }

        // Output: quantized = codebook[best_k]
        let best_ptr = codebook.wrapping_add((best_k * d) as usize);
        ascend_std::ascend_buf_load_f32(buf_c, best_ptr, d);
        ascend_std::ascend_pipe_barrier();
        let out_ptr = output.wrapping_add((i * d) as usize);
        ascend_std::ascend_buf_store_f32(out_ptr, buf_c, d);

        // EMA scatter-add: cb_sum[best_k] += x[i], cb_count[best_k] += 1
        let sum_ptr = cb_sum.wrapping_add((best_k * d) as usize);
        let sum_buf = ascend_std::ascend_buf_alloc(d);
        ascend_std::ascend_buf_load_f32(sum_buf, sum_ptr, d);
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_add_f32(sum_buf, sum_buf, buf_x, d);
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(sum_ptr, sum_buf, d);

        let count_val = *cb_count.wrapping_add(best_k as usize);
        *cb_count.wrapping_add(best_k as usize) = count_val + 1;

        i += 1;
    }
}
}

This buffer-API kernel runs on the Ascend AIV backend via rustc_codegen_mlir, and avoids materializing the N×K distance matrix and K-element index array. No tile-API safe::tile_vq_quantize_f32 currently exists — tile-API lowerings on all 9 backends (Ascend AIV / CUDA / Apple Metal / Vulkan SPIR-V / AWS NKI / AMD AIE / Cambricon BANG / Intel Gaudi / Google TPU) are future work. Cross-backend VQ today uses vendor kernels (aclnnMatmul, MPS GEMM, torch.cdist) with a separate argmin pass rather than the fused Rust kernel shown above.

Benchmark configurations

Shape (N, K, D)FLOPsNotes
(256, 512, 64)16.8 MSmall codebook, low-latency inference
(1024, 512, 64)67.1 MTypical VQ-VAE batch
(1024, 1024, 128)268 MLarge codebook, high-dim embeddings
(4096, 512, 64)268 MLarge batch training

All benchmarks use f32.

Results

DeviceShape (N, K, D)Latency (μs)GFLOPSNotes
Ascend 910B(4096, 1024, 128)9411,411aclnnMatmul L2 trick
Ascend 910B(1024, 1024, 128)318,604Large codebook
Ascend 910B(4096, 512, 64)436,243Large batch
Apple M2 Max(4096, 1024, 128)6461,662MPS GEMM + CPU argmin
Apple M2 Max(8192, 512, 64)4501,193Large batch
Tesla T4(4096, 1024, 128)1,163923torch.cdist + argmin
Tesla T4(1024, 1024, 128)494544torch.cdist + argmin
Tesla T4(4096, 512, 64)624430torch.cdist + argmin

Peak: 11.4 TFLOPS on Ascend 910B (cube engine via L2 distance matmul trick). Apple M2 Max peaks at 1.7 TFLOPS via MPS. Tesla T4 peaks at 923 GFLOPS via torch.cdist.

See Leaderboard filtered to VQ Quantize for the full filterable view.

Financial Sidecar

Real-time context for xPU investment and procurement decisions.

Stock prices (AI chip vendors)

TickerCompanyRole
NVDANVIDIAGPU market leader
AMDAMDMI300X, CDNA competitor
AAPLAppleM-series, Metal ecosystem
INTCIntelGaudi, Habana
GOOGGoogleTPU, custom silicon
AMZNAmazonTrainium, Inferentia

Device street prices

Tracking real-world prices (not MSRP) helps compute true cost-effectiveness:

DeviceMSRPStreet PriceSource
NVIDIA H100 SXM$30,000Check latesteBay, broker
NVIDIA A100 80GB$10,000Check latesteBay, broker
AMD MI300X$15,000Check latestAMD direct
Apple M4 Max (laptop)$3,999Check latestApple Store

Commodity reference

SymbolRelevance
Gold (XAU)Store-of-value benchmark
Oil (WTI)Energy cost proxy
BTCCrypto mining demand affects GPU pricing
USD/CNYHuawei/Cambricon pricing

Price data updated weekly via scripts/fetch_prices.py.

Submit Results

CSV format

Create a CSV file named <device-slug>.csv with these columns:

device_id,kernel_id,dtype,input_shape,batch_size,impl_lang,latency_us,driver_version,toolchain,git_sha,submitter
nvidia-h100-sxm,softmax,f32,"[64, 1024]",1,cuda,12.3,CUDA 12.4,nvcc 12.4,abc1234,your-name

Steps

  1. Fork the pu-rs.org repo
  2. Add your CSV to submissions/
  3. Open a pull request
  4. CI validates format and sanity checks
  5. Maintainers review and merge

Requirements

  • Minimum 20 runs per (kernel, shape) pair
  • Report median latency
  • Include driver version and toolchain
  • Device must exist in db/seed_devices.sql (or add it in the same PR)

Running the benchmark

All benchmark scripts live in this repo under scripts/.

# Metal (Apple Silicon)
# Requires: ascend_metal_kernels Python module
#   (build: cd ascend-rs/crates/ascend_metal_py && maturin develop --release)
ASCEND_METAL_KERNELS=1 python3 scripts/bench_metal.py --device apple-m2-max-38
ASCEND_METAL_KERNELS=1 python3 scripts/bench_metal.py --device apple-m4-max-40 -o submissions/m4-max.csv

# Ascend NPU (Huawei 910B/910C)
# Requires: CANN SDK + ascend-rs repo cloned locally
bash scripts/bench_ascend.sh --device huawei-910b
bash scripts/bench_ascend.sh --device huawei-910c --only softmax --ascend-rs ~/ascend-rs

Supported backends

BackendScriptPrerequisites
Apple Metalscripts/bench_metal.pyascend_metal_kernels Python module (build instructions)
Huawei Ascendscripts/bench_ascend.shCANN SDK + ascend-rs repo
NVIDIA CUDAscripts/bench_cuda.pyPlanned
AMD ROCmscripts/bench_rocm.pyPlanned