Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

pu-rs.org – Processing Unit Ranking System

The SPECfp for AI accelerators.

FLOPS don’t tell the full story. A chip rated at 1000 TFLOPS means nothing if your softmax kernel only achieves 5% utilization. pu-rs.org measures what matters: actual kernel execution time on real hardware, for the operations that AI workloads actually run.

Why this exists

What we measureWhat others report
Softmax latency at (64, 4096) f16Peak TFLOPS
LayerNorm throughput per wattMemory bandwidth (theoretical)
MatMul efficiency vs rooflineMarketing benchmarks
Cost per real GOPSCloud $/hour (opaque)

Scope

We benchmark the kernel primitives that compose every AI model:

CategoryKernels
ActivationSoftmax, GELU, SiLU
NormalizationLayerNorm, RMSNorm
Linear AlgebraGEMM, batched MatMul
AttentionScaled Dot-Product Attention
QuantizationVQ-Quantize, INT8 dequant
ConvolutionConv1D, dilated Conv1D
ReductionScatter-add, L1-smooth loss

Devices covered

TypeVendors
GPUNVIDIA (A100, H100, H200, B200), AMD (MI300X), Apple (M2/M4 Max), Cambricon
TPUGoogle (v5e, v6e Trillium)
NPUHuawei Ascend (910B, 910C), AWS Trainium2, Intel Gaudi 3

How it works

  1. Run standardized benchmark scripts on your hardware
  2. Submit CSV results via pull request
  3. CI validates format and sanity checks
  4. Leaderboard updates automatically with per-kernel rankings

All results tagged with git SHA, driver version, toolchain, and number of runs. Median latency reported. Full methodology.


Built with ascend-rs kernel infrastructure. Data updated weekly.