Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

pu-rs.org – Processing Unit Ranking System

The SPECfp for AI accelerators.

FLOPS don’t tell the full story. A chip rated at 1000 TFLOPS means nothing if your softmax kernel only achieves 5% utilization. pu-rs.org measures what matters: actual kernel execution time on real hardware, for the operations that AI workloads actually run.

Why this exists

What we measureWhat others report
Softmax latency at (64, 4096) f16Peak TFLOPS
LayerNorm throughput per wattMemory bandwidth (theoretical)
MatMul efficiency vs rooflineMarketing benchmarks
Cost per real GOPSCloud $/hour (opaque)

Scope

We benchmark the kernel primitives that compose every AI model:

CategoryKernels
ActivationSoftmax, GELU, SiLU
NormalizationLayerNorm, RMSNorm
Linear AlgebraGEMM, batched MatMul
AttentionScaled Dot-Product Attention
QuantizationVQ-Quantize, INT8 dequant
ConvolutionConv1D, dilated Conv1D
ReductionScatter-add, L1-smooth loss

Devices covered

TypeVendors
GPUNVIDIA (A100, H100, H200, B200), AMD (MI300X), Apple (M2/M4 Max), Cambricon
TPUGoogle (v5e, v6e Trillium)
NPUHuawei Ascend (910B, 910C), AWS Trainium2, Intel Gaudi 3

How it works

  1. Run standardized benchmark scripts on your hardware
  2. Submit CSV results via pull request
  3. CI validates format and sanity checks
  4. Leaderboard updates automatically with per-kernel rankings

All results tagged with git SHA, driver version, toolchain, and number of runs. Median latency reported. Full methodology.


Built with ascend-rs kernel infrastructure. Data updated weekly.

Leaderboard

# Vendor Device Type Kernel Dtype Shape Latency (us) GOPS GOPS/$ GOPS/W Verified

Cost Effectiveness

The most important metric for deployment decisions: how much real performance do you get per dollar and per watt?

# Device Kernel Latency (us) MSRP ($) TDP (W) GOPS/$ GOPS/W

Methodology

Measurement protocol

  1. Warmup: 50 iterations discarded
  2. Measurement: 500 iterations, median latency reported
  3. Amortization: Dispatch overhead amortized by batching 500 kernel launches into one command buffer where supported
  4. Isolation: Benchmarks run on idle systems, no background GPU workloads

What we measure

Kernel-only time: the GPU/NPU execution time for a single kernel dispatch, excluding:

  • Host-to-device data transfer (data assumed resident)
  • Command buffer creation overhead (amortized)
  • Python/framework overhead

This isolates the hardware+compiler efficiency from the software stack.

Reporting

MetricDefinition
Latency (us)Median kernel execution time in microseconds
GOPSThroughput: operations / latency
GOPS/$Throughput / device MSRP in USD
GOPS/WThroughput / TDP in watts

Standardized configurations

Each kernel is benchmarked at these canonical shapes:

KernelShapesDtypes
Softmax(1,1024), (64,1024), (64,4096)f32, f16
LayerNorm(1,768), (64,768), (1024,768)f32, f16
GEMM(1024,1024,1024), (4096,4096,4096)f32, f16, bf16
Attention(1,32,128,128), (32,32,2048,128)f32, f16

How to submit

See Submit Results.

Softmax

Category: Activation | Complexity: O(N) per row | Memory: 2 passes over input

Algorithm

The online 2-pass softmax (Milakov & Gimelshein 2018):

Pass 1 (single traversal): Maintain running (max, sum) pair per thread. When a new maximum is found, rescale the accumulated sum:

sum_new = sum_old * exp(max_old - max_new) + exp(x - max_new)

Pass 2: Write exp(x - global_max) / global_sum per element.

This is 33% less memory traffic than the naive 3-pass algorithm (max, exp+sum, normalize).

Benchmark configurations

ShapeElementsBytes (f32)Notes
(1, 1024)1K4 KBL1-resident, tests dispatch overhead
(64, 1024)64K256 KBL2-resident, typical batch
(64, 4096)256K1 MBBandwidth-bound regime

Results

See Leaderboard filtered to Softmax for full results.

LayerNorm

Category: Normalization | Complexity: O(N) per row | Memory: 3 passes

Algorithm

3-pass fused: mean, variance, normalize+affine in one workgroup:

  1. Mean: Parallel sum reduction, divide by N
  2. Variance: Parallel sum of (x - mean)^2, compute inverse std
  3. Affine: gamma * (x - mean) * inv_std + beta

Uses SIMD group shuffles for warp-level reductions (1 threadgroup barrier instead of 8).

Why ascend-rs beats MPS by 3x

  1. Single-pass kernel vs MPS’s separate dispatches
  2. No Python/ATen overhead (Rust metal crate -> Metal API directly)
  3. Fused command buffer (500 dispatches per commit)
  4. No intermediate buffer allocations

Benchmark configurations

ShapeNotes
(1, 768)GPT-2 hidden dim, single position
(64, 768)Typical batch
(1024, 768)Large batch

See Leaderboard filtered to LayerNorm for full results.

Matmul

Benchmark data coming soon. Submit results to be the first!

See Leaderboard for available results.

Attention

Benchmark data coming soon. Submit results to be the first!

See Leaderboard for available results.

Vq Quantize

Benchmark data coming soon. Submit results to be the first!

See Leaderboard for available results.

Conv1d

Benchmark data coming soon. Submit results to be the first!

See Leaderboard for available results.

Rms Norm

Benchmark data coming soon. Submit results to be the first!

See Leaderboard for available results.

Gelu

Benchmark data coming soon. Submit results to be the first!

See Leaderboard for available results.

Financial Sidecar

Real-time context for xPU investment and procurement decisions.

Stock prices (AI chip vendors)

TickerCompanyRole
NVDANVIDIAGPU market leader
AMDAMDMI300X, CDNA competitor
AAPLAppleM-series, Metal ecosystem
INTCIntelGaudi, Habana
GOOGGoogleTPU, custom silicon
AMZNAmazonTrainium, Inferentia

Device street prices

Tracking real-world prices (not MSRP) helps compute true cost-effectiveness:

DeviceMSRPStreet PriceSource
NVIDIA H100 SXM$30,000Check latesteBay, broker
NVIDIA A100 80GB$10,000Check latesteBay, broker
AMD MI300X$15,000Check latestAMD direct
Apple M4 Max (laptop)$3,999Check latestApple Store

Commodity reference

SymbolRelevance
Gold (XAU)Store-of-value benchmark
Oil (WTI)Energy cost proxy
BTCCrypto mining demand affects GPU pricing
USD/CNYHuawei/Cambricon pricing

Price data updated weekly via scripts/fetch_prices.py.

Submit Results

CSV format

Create a CSV file named <device-slug>.csv with these columns:

device_id,kernel_id,dtype,input_shape,batch_size,impl_lang,latency_us,driver_version,toolchain,git_sha,submitter
nvidia-h100-sxm,softmax,f32,"[64, 1024]",1,cuda,12.3,CUDA 12.4,nvcc 12.4,abc1234,your-name

Steps

  1. Fork the pu-rs.org repo
  2. Add your CSV to submissions/
  3. Open a pull request
  4. CI validates format and sanity checks
  5. Maintainers review and merge

Requirements

  • Minimum 20 runs per (kernel, shape) pair
  • Report median latency
  • Include driver version and toolchain
  • Device must exist in db/seed_devices.sql (or add it in the same PR)

Running the benchmark

# Metal (Apple Silicon)
ASCEND_METAL_KERNELS=1 python3 scripts/bench_metal.py --device apple-m2-max-38

# CUDA (NVIDIA)
python3 scripts/bench_cuda.py --device nvidia-h100-sxm

# Ascend (Huawei NPU)
bash benchmarks/kernel_bench/bench.sh