pu-rs.org – Processing Unit Ranking System

The SPECfp for AI accelerators.

FLOPS don’t tell the full story. A chip rated at 1000 TFLOPS means nothing if your softmax kernel only achieves 5% utilization. pu-rs.org measures what matters: actual kernel execution time on real hardware, for the operations that AI workloads actually run.

Why this exists

What we measure	What others report
Softmax latency at (64, 4096) f16	Peak TFLOPS
LayerNorm throughput per watt	Memory bandwidth (theoretical)
MatMul efficiency vs roofline	Marketing benchmarks
Cost per real GOPS	Cloud $/hour (opaque)

Scope

We benchmark the kernel primitives that compose every AI model:

Category	Kernels
Activation	Softmax, GELU, SiLU
Normalization	LayerNorm, RMSNorm
Linear Algebra	GEMM, batched MatMul
Attention	Scaled Dot-Product Attention
Quantization	VQ-Quantize, INT8 dequant
Convolution	Conv1D, dilated Conv1D
Reduction	Scatter-add, L1-smooth loss

Devices covered

Type	Vendors
GPU	NVIDIA (A100, H100, H200, B200), AMD (MI300X), Apple (M2/M4 Max), Cambricon
TPU	Google (v5e, v6e Trillium)
NPU	Huawei Ascend (910B, 910C), AWS Trainium2, Intel Gaudi 3

How it works

Run standardized benchmark scripts on your hardware
Submit CSV results via pull request
CI validates format and sanity checks
Leaderboard updates automatically with per-kernel rankings

All results tagged with git SHA, driver version, toolchain, and number of runs. Median latency reported. Full methodology.

End-to-end complement

Per-kernel latency is only half the story — a chip can win on softmax and still lose on a real model. The DeepSeek decode page reports end-to-end throughput across five accelerators (Ascend 910B2, TPU v2-8, Apple M2 Max, NVIDIA T4, AWS Trainium1) from the same 13-kernel Rust source emitted through the tile-rs MLIR backends.

Built with tile-rs kernel infrastructure (the generic multi-backend codegen framework; the Huawei Ascend NPU support is the tile-rs-ascend backend). Data updated weekly.

Leaderboard

#	Vendor	Device	Type	Kernel	Dtype	Shape	Latency (us)	GOPS	GOPS/$	GOPS/W	Verified

DeepSeek Decode Throughput (Cross-Vendor)

Per-kernel latency (the rest of this site) is the microbenchmark view. This page reports the complementary end-to-end decode throughput for a full DeepSeek-R1-Distill-Qwen-1.5B decode step, measured on five production accelerators from the same 13-kernel Rust source emitted through tile-rs’s per-vendor MLIR backends.

Headline

Rank	Device	Backend	Emitted language	Decode tok/s
1	Huawei Ascend 910B2	`mlir_to_cpp`+`mlir_to_pto`	AscendC C++ + PTO-MLIR	168.9
2	Google TPU v2-8 (Colab)	`mlir_to_tpu`	Pallas	162.9
3	Apple M2 Max	`mlir_to_msl`	Metal	91.7
4	NVIDIA T4 (Colab)	`mlir_to_gpu`	CUDA	53.7
5	AWS Trainium1 (`trn1.2xlarge`)	`mlir_to_nki`	NKI Python	12.2
–	CPU reference	–	plain Rust	3.7

All five numbers come from identical Rust kernel sources. The MLIR → vendor backend is the only thing that changes between rows. 168.9 tok/s on 910B2 is 2.47× the aclnn-only baseline and 45.6× the CPU reference.

Why this complements the kernel leaderboard

The kernel leaderboard tells you which chip runs a given softmax or GEMM fastest in isolation. Decode tok/s tells you what a real inference workload actually achieves once those kernels are composed with host-side launch overhead, KV cache traffic, and HBM pressure. A chip can win per-kernel and still lose on decode (the Trainium row is the clearest example — strong per-op latency, 9.5% bandwidth utilisation end-to-end).

Notes per device

Ascend 910B2 — the +pto half of the joint path contributes the four decode matmul shapes (1.75×–2.98× vs aclnn). RMSNorm stays on CPU; every other op is on-NPU. See the main tile-rs blog ch10 for the per-kernel breakdown.
TPU v2-8 — Colab-visible Pallas on a 4-chip v2 pod; the emitted kernel set is rms_norm, matvec_f16, and an attention fusion.
Apple M2 Max — emitted Metal beats Apple’s hand-tuned MLX on decode at this model size. Measured via deepseek_metal.
NVIDIA T4 — Colab Tesla T4; same three kernels as the TPU row, emitted in CUDA. 53.7 tok/s is below-roofline (T4 is HBM-bandwidth-starved for this model shape).
AWS Trainium1 — trn1.2xlarge. Six emitted NKI kernels (rms_norm, three matvec_f16 variants, gate_up_silu). Traced via torch_neuronx.trace in two halves (eager single-NEFF runs at 2.5 tok/s — 5× slower — because the single compile unit can’t pipeline across the whole decode path). Trace time: 461 s; wall time: 5.23 s for 64 decode steps.

Reproduction

Each number is reproducible with the commands documented in the per-device sections of tile-rs ch10. The source kernels live at crates/tile_std/src/tile.rs in the public repo; the per-vendor emitters live in crates/rustc_codegen_tile/src/mlir_to_*.rs.

The per-kernel leaderboard remains the authoritative view for isolated-kernel efficiency. Decode throughput is reported here as the complementary end-to-end metric.

Cost Effectiveness

The most important metric for deployment decisions: how much real performance do you get per dollar and per watt?

#	Device	Kernel	Latency (us)	MSRP ($)	TDP (W)	GOPS/$	GOPS/W

Methodology

Measurement protocol

Warmup: 50 iterations discarded
Measurement: 500 iterations, median latency reported
Amortization: Dispatch overhead amortized by batching 500 kernel launches into one command buffer where supported
Isolation: Benchmarks run on idle systems, no background GPU workloads

What we measure

Kernel-only time: the GPU/NPU execution time for a single kernel dispatch, excluding:

Host-to-device data transfer (data assumed resident)
Command buffer creation overhead (amortized)
Python/framework overhead

This isolates the hardware+compiler efficiency from the software stack.

Reporting

Metric	Definition
Latency (us)	Median kernel execution time in microseconds
GOPS	Throughput: operations / latency
GOPS/$	Throughput / device MSRP in USD
GOPS/W	Throughput / TDP in watts

Standardized configurations

Each kernel is benchmarked at these canonical shapes:

Kernel	Shapes	Dtypes
Softmax	(1,1024), (64,1024), (64,4096)	f32, f16
LayerNorm	(1,768), (64,768), (1024,768)	f32, f16
GEMM	(1024,1024,1024), (4096,4096,4096)	f32, f16, bf16
Attention	(1,32,128,128), (32,32,2048,128)	f32, f16

How to submit

See Submit Results.

GEMM (MatMul)

Category: Linear Algebra | Complexity: O(M·K·N) | Compute: Cube/Tensor Core bound

Algorithm

Dense matrix multiplication: C[M×N] = A[M×K] × B[K×N].

The fundamental ML primitive — dominates runtime in transformers (linear projections, attention scores, FFN layers). Performance depends on tiling strategy, memory hierarchy utilization, and hardware matrix units (cube engines, tensor cores).

tile-rs Kernel Source

Matrix multiplication in ascend-rs uses the tile API, which compiles to hardware-specific matmul units (cube engine on Ascend, tensor cores on CUDA, etc.).

Safe entry form — kernel body is pure safe Rust, shape committed at the type level:

#![allow(unused)]
fn main() {
use tile_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[tile_std::tile_kernel]
pub fn tile_matmul(
    a:      GmView<'_, M, K, f32>,
    b:      GmView<'_, K, N, f32>,
    output: GmViewMut<'_, M, N, f32>,
) {
    let at = tile_load_view_f32(&a);
    let bt = tile_load_view_f32(&b);
    let c  = safe::tile_matmul_f32(at, bt);
    tile_store_view_f32(&output, c);
}
}

No unsafe blocks in the body. A mismatched K between the two operands is a compile-time error. The #[tile_kernel] attribute rewrites the emitted signature to raw *const f32 / *mut f32 so the launcher toolchain sees an unchanged C ABI — #[repr(transparent)] makes the rewrite free at the LLVM IR level.

This compiles via rustc_codegen_tile → MLIR → target code on all 9 backends:

Ascend AIV: PTO-MLIR pto.tmatmul → cube engine (320 TFLOPS f16 on 910B)
CUDA: __shared__ tiled GEMM with __syncthreads()
Apple Metal / Vulkan SPIR-V: compute shader with shared-memory tiling
AWS NKI (Trainium): nki.isa.nc_matmul
AMD AIE: AIE2P cascade matmul
Cambricon BANG: __bang_matmul on MLU tensor units
Intel Gaudi: HPU matmul intrinsic
Google TPU: XLA dot_general via OpenXLA

For benchmarking, vendor-optimized libraries are used: aclnnMatmul (Ascend), cuBLAS (CUDA), MPSMatrixMultiplication (Metal).

Benchmark configurations

Shape (A × B)	FLOPs	Notes
[1024, 1024] × [1024, 1024]	2.1 G	Small square, tests dispatch overhead
[4096, 4096] × [4096, 4096]	137 G	Standard benchmark, bandwidth→compute transition
[8192, 8192] × [8192, 8192]	1.1 T	Large square, saturates compute units
[16384, 16384] × [16384, 16384]	8.8 T	Full hardware saturation
[1024, 4096] × [4096, 1024]	8.6 G	Rectangular, typical FFN down-projection
[4096, 1024] × [1024, 4096]	34.4 G	Rectangular, typical FFN up-projection
[2048, 8192] × [8192, 2048]	67.1 G	Transformer-scale attention projection

All benchmarks use f16 input with f16 output (or f32 accumulation where supported).

Results

Device	Shape	Latency (μs)	TFLOPS	GOPS/W
Ascend 910B	[4096²]×[4096²]	437	314.5	1014
Ascend 910B	[8192²]×[8192²]	3,614	304.2	981
Ascend 910B	[16384²]×[16384²]	27,467	320.2	1033
Ascend 910B	[2048, 8192]×[8192, 2048]	245	280.0	903
Ascend 910B	[4096, 1024]×[1024, 4096]	132	260.1	839
Apple M2 Max	[4096²]×[4096²]	17,374	7.9	53
Apple M2 Max	[8192²]×[8192²]	139,596	7.9	53
Apple M2 Max	[2048, 8192]×[8192, 2048]	8,972	7.7	51
Apple M2 Max	[4096, 1024]×[1024, 4096]	4,345	7.9	53
Tesla T4	[4096²]×[4096²]	5,698	24.1	345
Tesla T4	[8192²]×[8192²]	44,099	24.9	356
Tesla T4	[2048, 8192]×[8192, 2048]	2,567	26.8	383
Tesla T4	[4096, 1024]×[1024, 4096]	1,549	22.2	317

Peak: 320 TFLOPS (f16) on Ascend 910B — saturating the theoretical maximum. Tesla T4 peaks at 26.8 TFLOPS (f16) via cuBLAS (torch.matmul). Apple M2 Max peaks at 7.9 TFLOPS (f16) via MPSMatrixMultiplication.

See Leaderboard filtered to MatMul for the full filterable view.

Attention (Scaled Dot-Product)

Category: Attention | Complexity: O(B·H·S²·D) | Compute: Cube/Tensor Core bound

Algorithm

Scaled dot-product attention: Output = softmax(Q·K^T / √d) · V

The core transformer primitive — computes attention weights from queries and keys, applies softmax normalization, then produces a weighted sum of values. Dominates runtime in all transformer architectures (GPT, BERT, LLaMA, etc.).

Pipeline:

Scores = Q × K^T — matmul (S×D) × (D×S) → (S×S)
Scale by 1/√d — element-wise multiply
Softmax along last axis — numerically stable (max → sub → exp → sum → div)
Output = Weights × V — matmul (S×S) × (S×D) → (S×D)

tile-rs Kernel Source

The attention pipeline in ascend-rs combines tile-API matmul with custom Rust kernels for scale and softmax:

#![allow(unused)]
fn main() {
use ascend_rs::prelude::*;

let scale = 1.0f32 / (d_k as f32).sqrt();

// Step 1: scores = Q × K^T  (HGEMM via cube engine)
acl_blas_hgemm(TransN, TransT, TransN,
    seq_len, seq_len, d_k,
    &alpha, &d_q, d_k, &d_k_mat, d_k,
    &beta, &mut d_scores, seq_len,
    HighPrecision, &stream)?;

// Step 2: scores *= 1/√d_k  (custom Rust kernel → NPU)
scale_kernel.launch(1, &stream, &mut [
    d_scores.as_mut_ptr(),  // in-place
    d_scores.as_mut_ptr(),  // output (same buffer)
    d_n_scores.as_mut_ptr(),
    d_scale.as_mut_ptr(),
])?;

// Step 3: weights = softmax(scores)  (custom Rust kernel → NPU)
softmax_kernel.launch(1, &stream, &mut [
    d_scores.as_mut_ptr(),
    d_weights.as_mut_ptr(),
    d_row_len.as_mut_ptr(),
    d_num_rows.as_mut_ptr(),
])?;

// Step 4: output = weights × V  (HGEMM via cube engine)
acl_blas_hgemm(TransN, TransN, TransN,
    seq_len, d_k, seq_len,
    &alpha, &d_weights, seq_len, &d_v, d_k,
    &beta, &mut d_output, d_k,
    HighPrecision, &stream)?;
}

The scale and softmax kernels are written in Rust and compiled via rustc_codegen_tile → MLIR → backend code. The GEMMs use vendor-optimized libraries (aclnnMatmul, cuBLAS, MPSMatrixMultiplication).

Backend status for the fused safe::tile_attention_f32 tile op: Ascend AIV, Cambricon BANG, Intel Gaudi, Apple Metal, Vulkan SPIR-V (5/9). CUDA / AWS NKI / AMD AIE / Google TPU lowerings are TODO — on those backends the pipeline still runs as separate matmul + softmax + matmul dispatches using the individually-lowered tile ops.

Benchmark configurations

Shape (B, H, S, D)	FLOPs	Notes
(1, 1, 128, 64)	4.2 M	Small baseline, dispatch overhead test
(1, 1, 512, 64)	67 M	Medium sequence
(1, 1, 1024, 64)	268 M	GPT-2 scale
(1, 1, 2048, 64)	1.1 G	Long context
(1, 1, 4096, 64)	4.3 G	Very long context (quadratic scaling)
(1, 8, 512, 64)	537 M	8-head, GPT-2 like
(1, 12, 512, 64)	805 M	12-head, BERT-base
(1, 32, 512, 64)	2.1 G	32-head, LLaMA-7B
(1, 32, 1024, 128)	17.2 G	32-head, LLaMA-2-7B
(1, 32, 2048, 128)	68.7 G	32-head, long context

All benchmarks use f16 input with f16 output. FLOPs ≈ 4·B·H·S²·D (two matmuls dominate).

Results

Device	Shape (B,H,S,D)	Latency (μs)	TFLOPS	Notes
Ascend 910B	(1,32,1024,128)	310	55.4	aclnnMatmul+Softmax, manual pipeline
Ascend 910B	(1,32,2048,128)	1,459	47.1	Memory-bound at long context
Ascend 910B	(1,1,4096,64)	149	28.9	Single-head, large S
Ascend 910B	(1,32,512,64)	105	20.4	32-head, short context
Tesla T4	(1,32,1024,128)	2,609	6.6	F.scaled_dot_product_attention
Tesla T4	(1,32,2048,128)	5,067	13.6	Flash attention backend
Tesla T4	(1,1,4096,64)	974	4.4	Single-head
Tesla T4	(1,32,512,64)	427	5.0	32-head
Apple M2 Max	(1,32,1024,128)	138,819	0.12	MPS GEMM + CPU softmax
Apple M2 Max	(1,1,4096,64)	60,647	0.07	MPS GEMM + CPU softmax

Peak: 55.4 TFLOPS (f16) on Ascend 910B. Tesla T4 peaks at 13.6 TFLOPS (f16) via PyTorch SDPA. Apple M2 Max peaks at 0.14 TFLOPS — bottlenecked by CPU softmax (no fused MPS attention).

See Leaderboard filtered to Attention for the full filterable view.

Softmax

Category: Activation | Complexity: O(N) per row | Memory: 2 passes over input

Algorithm

The online 2-pass softmax (Milakov & Gimelshein 2018):

Pass 1 (single traversal): Maintain running (max, sum) pair per thread. When a new maximum is found, rescale the accumulated sum:

sum_new = sum_old * exp(max_old - max_new) + exp(x - max_new)

Pass 2: Write exp(x - global_max) / global_sum per element.

This is 33% less memory traffic than the naive 3-pass algorithm (max, exp+sum, normalize).

tile-rs Kernel Source

Softmax in ascend-rs uses the buffer API for element-wise backends and the tile API for matrix-oriented backends:

Scalar kernel (f32, benchmarked implementation):

#![allow(unused)]
fn main() {
#[tile_std::tile_kernel]
pub fn softmax(input: *const f32, output: *mut f32, len: *const u32) {
    let n = *len as usize;

    // Step 1: Find max for numerical stability
    let mut max_val = *input;
    let mut i = 1usize;
    loop {
        if i >= n { break; }
        let val = *input.wrapping_add(i);
        if val > max_val { max_val = val; }
        i += 1;
    }

    // Step 2: exp(x - max) and accumulate sum
    let mut sum: f32 = 0.0;
    i = 0;
    loop {
        if i >= n { break; }
        let exp_val = (*input.wrapping_add(i) - max_val).exp();
        *output.wrapping_add(i) = exp_val;
        sum += exp_val;
        i += 1;
    }

    // Step 3: Normalize
    i = 0;
    loop {
        if i >= n { break; }
        *output.wrapping_add(i) = *output.wrapping_add(i) / sum;
        i += 1;
    }
}
}

Tile API — safe entry form (lowered by rustc_codegen_tile to all 9 backends: Ascend AIV, CUDA, Apple Metal, Vulkan SPIR-V, AWS NKI, AMD AIE, Cambricon BANG, Intel Gaudi, Google TPU):

#![allow(unused)]
fn main() {
use tile_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[tile_std::tile_kernel]
pub fn tile_softmax(
    input:  GmView<'_, 1, 1024, f32>,
    output: GmViewMut<'_, 1, 1024, f32>,
) {
    let x = tile_load_view_f32(&input);
    let y = safe::tile_softmax_f32(x);
    tile_store_view_f32(&output, y);
}
}

The kernel body is pure safe Rust — no unsafe blocks. Shape (rows, cols, dtype) is committed at the type level via const generics, so any host-side mismatch becomes a compile-time error. The #[tile_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 so the launcher/compiler toolchain (bisheng / ACL / nvcc) sees the same C ABI. #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level — the two forms emit as literal symbol aliases.

Kernels compile via rustc_codegen_tile → MLIR → target-specific code. Softmax is one of the four “hot path” tile ops (alongside matmul, rms-norm, silu) that is lowered on every backend currently targeted.

Benchmark configurations

Shape	Elements	Bytes (f32)	Notes
(1, 1024)	1K	4 KB	L1-resident, tests dispatch overhead
(64, 1024)	64K	256 KB	L2-resident, typical batch
(64, 4096)	256K	1 MB	Bandwidth-bound regime

Results

See Leaderboard filtered to Softmax for the full filterable view.

Causal Mask

Category: Masking | Complexity: O(S^2) elementwise | Memory: 1 pass (read+write)

Algorithm

Causal masking sets the upper triangle of the attention score matrix to negative infinity, preventing tokens from attending to future positions:

For i, j in [0..S) x [0..S):
  if j > i:  scores[i,j] = -inf
  else:      scores[i,j] = scores[i,j]

Applied between Q@K^T and softmax in autoregressive (decoder) attention:

scores = Q @ K^T / sqrt(d)
scores = causal_mask(scores)   <-- this kernel
weights = softmax(scores)

This is memory-bandwidth bound (simple conditional copy), but critical for correctness in all decoder-only models (GPT, LLaMA, etc.).

tile-rs Kernel Source

Causal mask using the tile API — safe entry form:

#![allow(unused)]
fn main() {
use tile_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[tile_std::tile_kernel]
pub fn tile_causal_mask(
    input:  GmView<'_, 64, 64, f32>,
    output: GmViewMut<'_, 64, 64, f32>,
) {
    let scores = tile_load_view_f32(&input);
    let masked = safe::tile_causal_mask_f32(scores);
    tile_store_view_f32(&output, masked);
}
}

The kernel body is pure safe Rust — shape (rows, cols, dtype) is committed at the type level via const generics, so any host-side mismatch becomes a compile-time error. Square-shape enforcement (rows == cols) is also enforced at the type level. The #[tile_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 so the launcher toolchain sees the same C ABI; #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level.

Backend status (lowered by rustc_codegen_tile): Cambricon BANG, Intel Gaudi, Apple Metal, Vulkan SPIR-V (4/9). Ascend AIV / CUDA / AWS NKI / AMD AIE / Google TPU lowerings are TODO — on those backends causal masking is currently applied as a buffer-API element-wise compare-and-select rather than a single fused tile op.

Benchmark configurations

Shape (S, S)	Elements	Bytes (f32)	Notes
(64, 64)	4K	16 KB	Small attention window
(128, 128)	16K	64 KB	Standard context
(256, 256)	65K	256 KB	Medium context
(512, 512)	262K	1 MB	Long context

Results

See Leaderboard filtered to Causal Mask for the full filterable view.

RoPE (Rotary Position Embedding)

Category: Positional Encoding | Complexity: O(S*D) elementwise | Memory: 2 passes (read+write, plus cos/sin tables)

Algorithm

RoPE (Su et al. 2021) encodes position by rotating pairs of dimensions at frequency-dependent rates:

For each pair (x[2i], x[2i+1]):
  theta = pos / 10000^(2i/d)
  x'[2i]   = x[2i]*cos(theta) - x[2i+1]*sin(theta)
  x'[2i+1] = x[2i]*sin(theta) + x[2i+1]*cos(theta)

Used in every modern LLM (LLaMA, Mistral, GPT-NeoX, Qwen, etc.) to encode token position in Q/K vectors. RoPE is bandwidth-bound for short sequences and compute-bound (cos/sin) for long sequences.

tile-rs Kernel Source

RoPE using the tile API — safe entry form:

#![allow(unused)]
fn main() {
use tile_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[tile_std::tile_kernel]
pub fn tile_rope(
    input:  GmView<'_, 1, 128, f32>,
    output: GmViewMut<'_, 1, 128, f32>,
) {
    let x = tile_load_view_f32(&input);
    let y = safe::tile_rope_f32(x, 0);  // base position = 0
    tile_store_view_f32(&output, y);
}
}

The kernel body is pure safe Rust — shape (rows, cols, dtype) is committed at the type level via const generics, so any host-side mismatch becomes a compile-time error. The #[tile_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 so the launcher toolchain sees the same C ABI; #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level.

Backend status (lowered by rustc_codegen_tile): Cambricon BANG, Intel Gaudi, Apple Metal, Vulkan SPIR-V (4/9). Ascend AIV / CUDA / AWS NKI / AMD AIE / Google TPU lowerings are TODO — on those backends RoPE is currently expressed as a buffer-API composition of element-wise cos/sin/mul/add rather than a single fused tile op.

Benchmark configurations

Shape (B, S, D)	Elements	Bytes (f32)	Notes
(1, 64, 128)	8K	32 KB	Single query, short context
(32, 64, 128)	262K	1 MB	Batched queries
(1, 128, 128)	16K	64 KB	Longer head dim

Results

See Leaderboard filtered to RoPE for the full filterable view.

LayerNorm

Category: Normalization | Complexity: O(N) per row | Memory: 3 passes

Algorithm

3-pass fused: mean, variance, normalize+affine in one workgroup:

Mean: Parallel sum reduction, divide by N
Variance: Parallel sum of (x - mean)^2, compute inverse std
Affine: gamma * (x - mean) * inv_std + beta

Uses SIMD group shuffles for warp-level reductions (1 threadgroup barrier instead of 8).

tile-rs Kernel Source

LayerNorm in ascend-rs using vectorized AscendC intrinsics (f32, benchmarked implementation):

#![allow(unused)]
fn main() {
#[tile_std::tile_kernel]
pub fn layernorm(input: *const f32, output: *mut f32, len_buf: *const u32) {
    let n = *len_buf;
    let eps = 1.0e-5f32;

    let in_buf = tile_std::__tile_buf_alloc(n);
    let out_buf = tile_std::__tile_buf_alloc(n);
    let work = tile_std::__tile_buf_alloc(n);
    let rwork = tile_std::__tile_buf_alloc(n);

    // DMA load: GM -> local buffer
    tile_std::__tile_buf_load_f32(in_buf, input, n);
    tile_std::__tile_pipe_barrier();

    // Step 1: mean = sum(x) / n
    let sum_val = tile_std::__tile_reduce_sum_f32(work, in_buf, rwork, n);
    let mean = sum_val / (n as f32);

    // Step 2: centered = x - mean
    tile_std::__tile_adds_f32(out_buf, in_buf, -mean, n);
    tile_std::__tile_pipe_barrier();

    // Step 3: var = sum((x - mean)^2) / n
    tile_std::__tile_mul_f32(work, out_buf, out_buf, n);
    tile_std::__tile_pipe_barrier();
    let var_sum = tile_std::__tile_reduce_sum_f32(work, work, rwork, n);
    let inv_std = 1.0 / (var_sum / (n as f32) + eps).sqrt();

    // Step 4: output = centered * inv_std
    tile_std::__tile_muls_f32(out_buf, out_buf, inv_std, n);

    tile_std::__tile_pipe_barrier();
    tile_std::__tile_buf_store_f32(output, out_buf, n);
}
}

This buffer-API kernel is the primary implementation and runs on the Ascend AIV backend. A tile-API safe::tile_layernorm_f32 variant is additionally lowered by rustc_codegen_tile to Apple Metal (1/9) — the other 8 backend lowerings (Ascend AIV / CUDA / Vulkan SPIR-V / AWS NKI / AMD AIE / Cambricon BANG / Intel Gaudi / Google TPU) are future work. On non-Metal backends, LayerNorm is currently composed at the buffer API as shown above (mean → sub → mul² → mean → sqrt → mul) rather than emitted as a single tile op.

Benchmark configurations

Shape	Notes
(1, 768)	GPT-2 hidden dim, single position
(64, 768)	Typical batch
(1024, 768)	Large batch

Results

See Leaderboard filtered to LayerNorm for the full filterable view.

RMS Norm

Category: Normalization | Complexity: O(N) per row | Memory: 2 passes

Algorithm

RMSNorm (Zhang & Sennrich 2019) is a simplified LayerNorm used in LLaMA, Gemma, and most modern LLMs. It omits the mean-centering step:

RMS: Compute root-mean-square: rms = sqrt(mean(x²) + ε)
Normalize + Scale: y = (x / rms) * gamma

Compared to LayerNorm, RMSNorm saves one reduction pass (no mean computation) and one elementwise subtraction, yielding ~15% faster inference at equal accuracy.

tile-rs Kernel Source

RMS Norm using tile-rs buffer API (f32):

#![allow(unused)]
fn main() {
/// RMS Norm: y[i] = (x[i] / rms) * gamma[i]
/// where rms = sqrt(mean(x²) + eps)
///
/// params: [n: u32]
#[tile_std::tile_kernel]
pub fn rms_norm(
    input: *const f32,
    gamma: *const f32,
    output: *mut f32,
    params: *const u32,
) {
    let n = *params;
    let eps = 1.0e-5f32;

    let in_buf = tile_std::__tile_buf_alloc(n);
    let gamma_buf = tile_std::__tile_buf_alloc(n);
    let work = tile_std::__tile_buf_alloc(n);
    let rwork = tile_std::__tile_buf_alloc(n);

    // Load input and gamma
    tile_std::__tile_buf_load_f32(in_buf, input, n);
    tile_std::__tile_buf_load_f32(gamma_buf, gamma, n);
    tile_std::__tile_pipe_barrier();

    // Step 1: x² → work
    tile_std::__tile_mul_f32(work, in_buf, in_buf, n);
    tile_std::__tile_pipe_barrier();

    // Step 2: rms = sqrt(mean(x²) + eps)
    let sq_sum = tile_std::__tile_reduce_sum_f32(work, work, rwork, n);
    let inv_rms = 1.0 / (sq_sum / (n as f32) + eps).sqrt();

    // Step 3: output = (x * inv_rms) * gamma
    tile_std::__tile_muls_f32(work, in_buf, inv_rms, n);
    tile_std::__tile_pipe_barrier();
    tile_std::__tile_mul_f32(work, work, gamma_buf, n);

    tile_std::__tile_pipe_barrier();
    tile_std::__tile_buf_store_f32(output, work, n);
}
}

This buffer-API kernel runs on the Ascend AIV backend. A tile-API safe::tile_rms_norm_f32 variant is additionally lowered by rustc_codegen_tile to all 9 backends (Ascend AIV, CUDA, Apple Metal, Vulkan SPIR-V, AWS NKI, AMD AIE, Cambricon BANG, Intel Gaudi, Google TPU) — RMS Norm is one of the four “hot path” tile ops (alongside matmul, softmax, silu) that is lowered on every backend currently targeted.

Benchmark configurations

Shape	Elements	Bytes (f32)	Notes
(1, 768)	768	3 KB	GPT-2 hidden dim, single token
(1, 4096)	4K	16 KB	LLaMA-7B hidden dim
(64, 768)	49K	192 KB	Typical batch, GPT-2
(64, 4096)	262K	1 MB	Typical batch, LLaMA
(1024, 4096)	4.2M	16 MB	Large batch, bandwidth-bound

All benchmarks use f32.

Results

See Leaderboard filtered to RMS Norm for the full filterable view.

GELU

Category: Activation | Complexity: O(N) elementwise | Memory: 1 pass (fused read+write)

Algorithm

GELU (Gaussian Error Linear Unit, Hendrycks & Gimpel 2016) is the standard activation in BERT, GPT, LLaMA, and most transformer models:

GELU(x) = x · Φ(x) = x · 0.5 · (1 + erf(x / √2))

The fast tanh approximation (used in PyTorch gelu(approximate='tanh')):

GELU(x) ≈ 0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³)))

GELU is memory-bandwidth bound — the compute-to-byte ratio is low (a few FLOPs per 4-byte element), so peak throughput is measured in GB/s rather than TFLOPS.

tile-rs Kernel Source

GELU using tile-rs buffer API (f32, tanh approximation):

#![allow(unused)]
fn main() {
/// GELU activation: y = 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
///
/// params: [n: u32]
#[tile_std::tile_kernel]
pub fn gelu(
    input: *const f32,
    output: *mut f32,
    params: *const u32,
) {
    let n = *params as usize;
    let sqrt_2_pi: f32 = 0.7978845608; // sqrt(2/pi)
    let coeff: f32 = 0.044715;

    let mut i: usize = 0;
    while i < n {
        let x = *input.wrapping_add(i);
        let x3 = x * x * x;
        let inner = sqrt_2_pi * (x + coeff * x3);
        // tanh via exp: tanh(z) = (e^2z - 1)/(e^2z + 1)
        let e2z = (2.0 * inner).exp();
        let tanh_val = (e2z - 1.0) / (e2z + 1.0);
        *output.wrapping_add(i) = 0.5 * x * (1.0 + tanh_val);
        i += 1;
    }
}
}

Vectorized version using buffer intrinsics:

#![allow(unused)]
fn main() {
#[tile_std::tile_kernel]
pub fn gelu_vec(
    input: *const f32,
    output: *mut f32,
    params: *const u32,
) {
    let n = *params;
    let in_buf = tile_std::__tile_buf_alloc(n);
    let work = tile_std::__tile_buf_alloc(n);
    let work2 = tile_std::__tile_buf_alloc(n);

    tile_std::__tile_buf_load_f32(in_buf, input, n);
    tile_std::__tile_pipe_barrier();

    // x³
    tile_std::__tile_mul_f32(work, in_buf, in_buf, n);
    tile_std::__tile_pipe_barrier();
    tile_std::__tile_mul_f32(work, work, in_buf, n);
    tile_std::__tile_pipe_barrier();
    // 0.044715 * x³
    tile_std::__tile_muls_f32(work, work, 0.044715, n);
    tile_std::__tile_pipe_barrier();
    // x + 0.044715 * x³
    tile_std::__tile_add_f32(work, in_buf, work, n);
    tile_std::__tile_pipe_barrier();
    // sqrt(2/pi) * (x + 0.044715 * x³)
    tile_std::__tile_muls_f32(work, work, 0.7978845608, n);
    tile_std::__tile_pipe_barrier();

    // Store result
    tile_std::__tile_buf_store_f32(output, work, n);
}
}

These buffer-API kernels run on the Ascend AIV backend via rustc_codegen_tile. No tile-API safe::tile_gelu_f32 currently exists — tile-API lowerings on all 9 backends (Ascend AIV / CUDA / Apple Metal / Vulkan SPIR-V / AWS NKI / AMD AIE / Cambricon BANG / Intel Gaudi / Google TPU) are future work. Cross-backend execution today goes through the buffer-API scalar loop or the element-wise intrinsic composition shown above.

Benchmark configurations

Shape	Elements	Bytes (f32)	Notes
(1, 768)	768	3 KB	GPT-2 hidden dim
(1, 4096)	4K	16 KB	LLaMA hidden dim
(64, 768)	49K	192 KB	Typical batch
(64, 4096)	262K	1 MB	Bandwidth-bound
(1024, 4096)	4.2M	16 MB	Large batch

All benchmarks use f32.

Results

See Leaderboard filtered to GELU for the full filterable view.

SiLU / Swish

Category: Activation | Complexity: O(N) elementwise | Memory: 1 pass (fused read+write)

Algorithm

SiLU (Sigmoid Linear Unit), also known as Swish (Ramachandran et al. 2017), is the gate activation in LLaMA, Mistral, and most modern LLMs:

SiLU(x) = x * sigmoid(x) = x / (1 + exp(-x))

Used in the FFN block:

FFN(x) = SiLU(W_gate * x) * (W_up * x)

Like GELU, SiLU is memory-bandwidth bound. The compute-to-byte ratio is low (a few FLOPs per element), so throughput is measured in GB/s.

tile-rs Kernel Source

SiLU using the tile API — safe entry form (lowered by rustc_codegen_tile to all 9 backends: Ascend AIV, CUDA, Apple Metal, Vulkan SPIR-V, AWS NKI, AMD AIE, Cambricon BANG, Intel Gaudi, Google TPU):

#![allow(unused)]
fn main() {
use tile_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[tile_std::tile_kernel]
pub fn tile_silu(
    input:  GmView<'_, 1, 4096, f32>,
    output: GmViewMut<'_, 1, 4096, f32>,
) {
    let x = tile_load_view_f32(&input);
    let y = safe::tile_silu_f32(x);
    tile_store_view_f32(&output, y);
}
}

safe::tile_silu_f32 decomposes to: neg → exp → add_scalar(1) → reciprocal → mul with original x. SiLU is one of the four “hot path” tile ops (alongside matmul, softmax, rms-norm) that is lowered on every backend currently targeted.

Benchmark configurations

Shape	Elements	Bytes (f32)	Notes
(1, 768)	768	3 KB	GPT-2 hidden dim
(1, 4096)	4K	16 KB	LLaMA hidden dim
(64, 4096)	262K	1 MB	Typical batch
(1024, 4096)	4.2M	16 MB	Large batch

Results

See Leaderboard filtered to SiLU for the full filterable view.

Embedding Lookup

Category: Memory Access | Complexity: O(N*D) gather | Memory: Random access (bandwidth-bound)

Algorithm

Embedding lookup gathers rows from a (V, D) weight table by token indices:

For each token index t[i] in [0..V):
  output[i, :] = weight[t[i], :]

This is the first operation in any transformer: tokens (integers) become vectors. It is purely bandwidth-bound with random access patterns, making it a key memory subsystem benchmark.

tile-rs Kernel Source

Embedding using the tile API — safe entry form with one unsafe block (the indices pointer is an integer gather source, not a tile, so safe::tile_embedding_f32 is declared pub unsafe fn):

#![allow(unused)]
fn main() {
use tile_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[tile_std::tile_kernel]
pub fn tile_embedding(
    weight:  GmView<'_, 32000, 128, f32>,  // (V, D) codebook
    indices: *const u32,                   // (N,) token ids — integer gather source
    output:  GmViewMut<'_, 32, 128, f32>,  // (N, D) gathered rows
) {
    let w = tile_load_view_f32(&weight);
    // SAFETY: `indices` is a valid *const u32 of length COUNT=32, guaranteed by
    // the launcher. The unsafe wrapper is the only non-safe surface.
    let emb = unsafe { safe::tile_embedding_f32(w, indices) };
    tile_store_view_f32(&output, emb);
}
}

Weight table and output shapes are committed at the type level via const generics (V, D, N), so any host-side mismatch becomes a compile-time error. The #[tile_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 for the tile params so the launcher toolchain sees the same C ABI; #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level.

Backend status (lowered by rustc_codegen_tile): Cambricon BANG, Intel Gaudi, Apple Metal, Vulkan SPIR-V. Ascend AIV / CUDA / AWS NKI / AMD AIE / Google TPU lowerings are TODO.

Benchmark configurations

Shape (N, V, D)	Output Elements	Bytes (f32)	Notes
(32, 32000, 128)	4K	16 KB	LLaMA-2 vocab, small dim
(128, 32000, 128)	16K	64 KB	Larger batch
(32, 32000, 4096)	131K	512 KB	Full hidden dim

Results

See Leaderboard filtered to Embedding for the full filterable view.

Cross-Entropy Loss

Category: Loss Function | Complexity: O(N*V) reduction | Memory: 2 passes (max + sum-exp)

Algorithm

Cross-entropy loss is the standard training objective for classification and language modeling:

loss[i] = -logits[i, target[i]] + log(sum(exp(logits[i, :])))

Numerically stable version (log-sum-exp trick):

m = max(logits[i, :])
loss[i] = -(logits[i, target[i]] - m) + log(sum(exp(logits[i, :] - m)))

This kernel is compute-heavy for large vocabularies (V=32000+) due to the row-wise exp and reduction. It combines softmax-like reduction with an index gather.

tile-rs Kernel Source

Cross-entropy using the tile API — safe entry form with one unsafe block (the targets pointer is an integer gather source, not a tile, so safe::tile_cross_entropy_f32 is declared pub unsafe fn):

#![allow(unused)]
fn main() {
use tile_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[tile_std::tile_kernel]
pub fn tile_cross_entropy(
    logits:  GmView<'_, 32, 32000, f32>,  // (N, V)
    targets: *const u32,                  // (N,) target class ids — integer gather
    loss:    GmViewMut<'_, 32, 1, f32>,   // (N, 1) per-row loss
) {
    let x = tile_load_view_f32(&logits);
    // SAFETY: `targets` is a valid *const u32 of length R=32, guaranteed by
    // the launcher. The unsafe wrapper is the only non-safe surface.
    let y = unsafe { safe::tile_cross_entropy_f32(x, targets) };
    tile_store_view_f32(&loss, y);
}
}

Logits and loss shapes (and their shared N) are committed at the type level via const generics, so any host-side mismatch becomes a compile-time error. The #[tile_kernel] attribute rewrites the emitted signature back to raw *const f32 / *mut f32 for the tile params so the launcher toolchain sees the same C ABI; #[repr(transparent)] on GmView/GmViewMut makes this rewrite free at the LLVM IR level.

Backend status (lowered by rustc_codegen_tile): Cambricon BANG, Intel Gaudi. Ascend AIV / CUDA / Apple Metal / Vulkan SPIR-V / AWS NKI / AMD AIE / Google TPU lowerings are TODO — this is the narrowest backend coverage of any kernel page, reflecting that cross-entropy is primarily a training-loss primitive.

Benchmark configurations

Shape (N, V)	Elements	Bytes (f32)	Notes
(32, 32000)	1M	4 MB	LLaMA-2 vocab, small batch
(128, 32000)	4M	16 MB	Larger batch
(32, 50257)	1.6M	6.4 MB	GPT-2 vocab

Results

See Leaderboard filtered to Cross-Entropy for the full filterable view.

Dilated Conv1D + ReLU

Category: Convolution | Complexity: O(B·T·C²·3) | Fusion: pad + gather + matmul + ReLU

Algorithm

A dilated 1D convolution with kernel size 3, fused with bias and ReLU activation. Used in VQ-VAE encoder/decoder ResConv1DBlocks (e.g., SOKE, Jukebox-style models).

The naive implementation requires:

Pad the input by dilation on each side (zero-padding)
Gather 3 positions per output: [t-d, t, t+d]
Concat along the channel axis -> (B, T, 3C)
Linear projection (3C -> C)
ReLU

This kernel fuses all 5 steps into a single GPU pass, eliminating three intermediate (B, T, 3C) buffer allocations and the data shuffles between them.

Why fusion matters

For an 18-block VQ-VAE encoder/decoder, the unfused version allocates 54 intermediate tensors per forward pass and reads them back. Fusing into one kernel:

Eliminates intermediate buffer writes/reads (3x memory bandwidth reduction)
Keeps activations in registers/L1 cache between stages
One command buffer dispatch instead of five

tile-rs Kernel Source

Vectorized dilated conv1d + ReLU using tile-rs buffer API (f32, benchmarked implementation):

#![allow(unused)]
fn main() {
#[tile_std::tile_kernel]
pub fn conv1d_dilated(input: *const f32, output: *mut f32, params: *const u32) {
    let n = *params;
    let dilation = *params.wrapping_add(1);
    let w0 = f32::from_bits(*params.wrapping_add(2));
    let w1 = f32::from_bits(*params.wrapping_add(3));
    let w2 = f32::from_bits(*params.wrapping_add(4));
    let bias = f32::from_bits(*params.wrapping_add(5));

    let aligned_n = ((n + 7) / 8) * 8;
    let in_buf = tile_std::__tile_buf_alloc(aligned_n);
    let tap_left = tile_std::__tile_buf_alloc(aligned_n);
    let tap_right = tile_std::__tile_buf_alloc(aligned_n);
    let acc = tile_std::__tile_buf_alloc(aligned_n);
    let work = tile_std::__tile_buf_alloc(aligned_n);

    tile_std::__tile_buf_load_f32(in_buf, input, n);
    tile_std::__tile_pipe_barrier();

    // Build shifted tap buffers with zero-padding
    tile_std::__tile_buf_fill_f32(tap_left, 0.0, aligned_n);
    let mut i = dilation;
    while i < n {
        let v = tile_std::__tile_get_value_f32(in_buf, i - dilation);
        tile_std::__tile_set_value_f32(tap_left, i, v);
        i += 1;
    }
    tile_std::__tile_buf_fill_f32(tap_right, 0.0, aligned_n);
    i = 0;
    while i + dilation < n {
        let v = tile_std::__tile_get_value_f32(in_buf, i + dilation);
        tile_std::__tile_set_value_f32(tap_right, i, v);
        i += 1;
    }

    // Vector MAC: acc = tap_left*w0 + input*w1 + tap_right*w2 + bias
    tile_std::__tile_muls_f32(acc, tap_left, w0, n);
    tile_std::__tile_muls_f32(work, in_buf, w1, n);
    tile_std::__tile_add_f32(tap_left, acc, work, n);
    tile_std::__tile_muls_f32(work, tap_right, w2, n);
    tile_std::__tile_add_f32(acc, tap_left, work, n);
    tile_std::__tile_adds_f32(acc, acc, bias, n);
    tile_std::__tile_maxs_f32(acc, acc, 0.0, n);  // ReLU

    tile_std::__tile_pipe_barrier();
    tile_std::__tile_buf_store_f32(output, acc, n);
}
}

This buffer-API kernel runs on the Ascend AIV backend via rustc_codegen_tile. No tile-API safe::tile_conv1d_f32 currently exists — tile-API lowerings on all 9 backends (Ascend AIV / CUDA / Apple Metal / Vulkan SPIR-V / AWS NKI / AMD AIE / Cambricon BANG / Intel Gaudi / Google TPU) are future work. On non-Ascend backends the fused pad+gather+matmul+ReLU is currently expressed as a buffer-API composition rather than a single tile op.

Benchmark configurations

Shape (B, T, C)	Elements	Notes
(2, 50, 512)	51 K	Single VQ-VAE block, small batch
(8, 100, 512)	410 K	Mid-sized clip
(2, 400, 512)	410 K	Long sequence

Results

See Leaderboard filtered to conv1d-dilated for the full filterable view.

VQ Quantize + EMA Update

Category: Quantization | Complexity: O(N·K·D) | Fusion: L2 distance + argmin + scatter-add

Algorithm

Vector quantization maps each input vector to its nearest codebook entry, then updates the codebook via exponential moving average (EMA). Used in VQ-VAE training (SOKE, Jukebox, SoundStream).

Pipeline:

L2 distance: For each input vector x[i] (dim D), compute ||x[i] - c[k]||² against all K codebook entries
Argmin: Find nearest codebook entry k* = argmin_k ||x[i] - c[k]||²
Quantize: Output q[i] = c[k*] (the nearest codebook vector)
EMA scatter-add: Accumulate x[i] into codebook slot k* for EMA update: sum[k*] += x[i], count[k*] += 1

Fusing all 4 steps into one kernel eliminates 3 intermediate buffers (distance matrix, index array, scatter workspace).

tile-rs Kernel Source

VQ quantize kernel using tile-rs buffer API (f32):

#![allow(unused)]
fn main() {
/// VQ Quantize: for each input vector, find nearest codebook entry (L2),
/// output the quantized vector, and scatter-add for EMA codebook update.
///
/// params: [n_vectors: u32, n_codes: u32, dim: u32]
#[tile_std::tile_kernel]
pub fn vq_quantize(
    input: *const f32,      // (N, D) input vectors
    codebook: *const f32,   // (K, D) codebook
    output: *mut f32,       // (N, D) quantized output
    cb_sum: *mut f32,       // (K, D) EMA numerator accumulator
    cb_count: *mut u32,     // (K,)   EMA denominator counter
    params: *const u32,
) {
    let n = *params;                      // number of input vectors
    let k = *params.wrapping_add(1);      // codebook size
    let d = *params.wrapping_add(2);      // vector dimension

    let buf_x = tile_std::__tile_buf_alloc(d);     // current input vector
    let buf_c = tile_std::__tile_buf_alloc(d);     // current codebook entry
    let buf_diff = tile_std::__tile_buf_alloc(d);  // x - c
    let buf_work = tile_std::__tile_buf_alloc(d);
    let buf_rwork = tile_std::__tile_buf_alloc(d);

    let mut i: u32 = 0;
    while i < n {
        // Load input vector x[i]
        let x_ptr = input.wrapping_add((i * d) as usize);
        tile_std::__tile_buf_load_f32(buf_x, x_ptr, d);
        tile_std::__tile_pipe_barrier();

        // Find nearest codebook entry (L2 argmin)
        let mut best_k: u32 = 0;
        let mut best_dist: f32 = f32::MAX;

        let mut j: u32 = 0;
        while j < k {
            let c_ptr = codebook.wrapping_add((j * d) as usize);
            tile_std::__tile_buf_load_f32(buf_c, c_ptr, d);
            tile_std::__tile_pipe_barrier();

            // diff = x - c
            tile_std::__tile_sub_f32(buf_diff, buf_x, buf_c, d);
            tile_std::__tile_pipe_barrier();
            // diff² = diff * diff
            tile_std::__tile_mul_f32(buf_diff, buf_diff, buf_diff, d);
            tile_std::__tile_pipe_barrier();
            // dist = sum(diff²)
            let dist = tile_std::__tile_reduce_sum_f32(
                buf_work, buf_diff, buf_rwork, d);

            if dist < best_dist {
                best_dist = dist;
                best_k = j;
            }
            j += 1;
        }

        // Output: quantized = codebook[best_k]
        let best_ptr = codebook.wrapping_add((best_k * d) as usize);
        tile_std::__tile_buf_load_f32(buf_c, best_ptr, d);
        tile_std::__tile_pipe_barrier();
        let out_ptr = output.wrapping_add((i * d) as usize);
        tile_std::__tile_buf_store_f32(out_ptr, buf_c, d);

        // EMA scatter-add: cb_sum[best_k] += x[i], cb_count[best_k] += 1
        let sum_ptr = cb_sum.wrapping_add((best_k * d) as usize);
        let sum_buf = tile_std::__tile_buf_alloc(d);
        tile_std::__tile_buf_load_f32(sum_buf, sum_ptr, d);
        tile_std::__tile_pipe_barrier();
        tile_std::__tile_add_f32(sum_buf, sum_buf, buf_x, d);
        tile_std::__tile_pipe_barrier();
        tile_std::__tile_buf_store_f32(sum_ptr, sum_buf, d);

        let count_val = *cb_count.wrapping_add(best_k as usize);
        *cb_count.wrapping_add(best_k as usize) = count_val + 1;

        i += 1;
    }
}
}

This buffer-API kernel runs on the Ascend AIV backend via rustc_codegen_tile, and avoids materializing the N×K distance matrix and K-element index array. No tile-API safe::tile_vq_quantize_f32 currently exists — tile-API lowerings on all 9 backends (Ascend AIV / CUDA / Apple Metal / Vulkan SPIR-V / AWS NKI / AMD AIE / Cambricon BANG / Intel Gaudi / Google TPU) are future work. Cross-backend VQ today uses vendor kernels (aclnnMatmul, MPS GEMM, torch.cdist) with a separate argmin pass rather than the fused Rust kernel shown above.

Benchmark configurations

Shape (N, K, D)	FLOPs	Notes
(256, 512, 64)	16.8 M	Small codebook, low-latency inference
(1024, 512, 64)	67.1 M	Typical VQ-VAE batch
(1024, 1024, 128)	268 M	Large codebook, high-dim embeddings
(4096, 512, 64)	268 M	Large batch training

All benchmarks use f32.

Results

Device	Shape (N, K, D)	Latency (μs)	GFLOPS	Notes
Ascend 910B	(4096, 1024, 128)	94	11,411	aclnnMatmul L2 trick
Ascend 910B	(1024, 1024, 128)	31	8,604	Large codebook
Ascend 910B	(4096, 512, 64)	43	6,243	Large batch
Apple M2 Max	(4096, 1024, 128)	646	1,662	MPS GEMM + CPU argmin
Apple M2 Max	(8192, 512, 64)	450	1,193	Large batch
Tesla T4	(4096, 1024, 128)	1,163	923	torch.cdist + argmin
Tesla T4	(1024, 1024, 128)	494	544	torch.cdist + argmin
Tesla T4	(4096, 512, 64)	624	430	torch.cdist + argmin

Peak: 11.4 TFLOPS on Ascend 910B (cube engine via L2 distance matmul trick). Apple M2 Max peaks at 1.7 TFLOPS via MPS. Tesla T4 peaks at 923 GFLOPS via torch.cdist.

See Leaderboard filtered to VQ Quantize for the full filterable view.

Financial Sidecar

Real-time context for xPU investment and procurement decisions.

Stock prices (AI chip vendors)

Ticker	Company	Role
NVDA	NVIDIA	GPU market leader
AMD	AMD	MI300X, CDNA competitor
AAPL	Apple	M-series, Metal ecosystem
INTC	Intel	Gaudi, Habana
GOOG	Google	TPU, custom silicon
AMZN	Amazon	Trainium, Inferentia

Device street prices

Tracking real-world prices (not MSRP) helps compute true cost-effectiveness:

Device	MSRP	Street Price	Source
NVIDIA H100 SXM	$30,000	Check latest	eBay, broker
NVIDIA A100 80GB	$10,000	Check latest	eBay, broker
AMD MI300X	$15,000	Check latest	AMD direct
Apple M4 Max (laptop)	$3,999	Check latest	Apple Store

Commodity reference

Symbol	Relevance
Gold (XAU)	Store-of-value benchmark
Oil (WTI)	Energy cost proxy
BTC	Crypto mining demand affects GPU pricing
USD/CNY	Huawei/Cambricon pricing

Price data updated weekly via scripts/fetch_prices.py.

Submit Results

CSV format

Create a CSV file named <device-slug>.csv with these columns:

device_id,kernel_id,dtype,input_shape,batch_size,impl_lang,latency_us,driver_version,toolchain,git_sha,submitter
nvidia-h100-sxm,softmax,f32,"[64, 1024]",1,cuda,12.3,CUDA 12.4,nvcc 12.4,abc1234,your-name

Steps

Fork the pu-rs.org repo
Add your CSV to submissions/
Open a pull request
CI validates format and sanity checks
Maintainers review and merge

Requirements

Minimum 20 runs per (kernel, shape) pair
Report median latency
Include driver version and toolchain
Device must exist in db/seed_devices.sql (or add it in the same PR)

Running the benchmark

All benchmark scripts live in this repo under scripts/.

# Metal (Apple Silicon)
# Requires: ascend_metal_kernels Python module
#   (build: cd tile-rs-ascend/crates/ascend_metal_py && maturin develop --release)
ASCEND_METAL_KERNELS=1 python3 scripts/bench_metal.py --device apple-m2-max-38
ASCEND_METAL_KERNELS=1 python3 scripts/bench_metal.py --device apple-m4-max-40 -o submissions/m4-max.csv

# Ascend NPU (Huawei 910B/910C)
# Requires: CANN SDK + the tile-rs-ascend backend repo cloned locally
bash scripts/bench_ascend.sh --device huawei-910b
bash scripts/bench_ascend.sh --device huawei-910c --only softmax --ascend-rs ~/tile-rs-ascend

Supported backends

Backend	Script	Prerequisites
Apple Metal	`scripts/bench_metal.py`	`ascend_metal_kernels` Python module (build instructions)
Huawei Ascend	`scripts/bench_ascend.sh`	CANN SDK + tile-rs-ascend backend repo
NVIDIA CUDA	`scripts/bench_cuda.py`	`mlir_to_cuda` codegen (T4 / H20 results published)
AMD ROCm	`scripts/bench_rocm.py`	Planned

Keyboard shortcuts

pu-rs.org — xPU Kernel Benchmark