pu-rs.org – Processing Unit Ranking System
The SPECfp for AI accelerators.
FLOPS don’t tell the full story. A chip rated at 1000 TFLOPS means nothing if your softmax kernel only achieves 5% utilization. pu-rs.org measures what matters: actual kernel execution time on real hardware, for the operations that AI workloads actually run.
Why this exists
| What we measure | What others report |
|---|---|
| Softmax latency at (64, 4096) f16 | Peak TFLOPS |
| LayerNorm throughput per watt | Memory bandwidth (theoretical) |
| MatMul efficiency vs roofline | Marketing benchmarks |
| Cost per real GOPS | Cloud $/hour (opaque) |
Scope
We benchmark the kernel primitives that compose every AI model:
| Category | Kernels |
|---|---|
| Activation | Softmax, GELU, SiLU |
| Normalization | LayerNorm, RMSNorm |
| Linear Algebra | GEMM, batched MatMul |
| Attention | Scaled Dot-Product Attention |
| Quantization | VQ-Quantize, INT8 dequant |
| Convolution | Conv1D, dilated Conv1D |
| Reduction | Scatter-add, L1-smooth loss |
Devices covered
| Type | Vendors |
|---|---|
| GPU | NVIDIA (A100, H100, H200, B200), AMD (MI300X), Apple (M2/M4 Max), Cambricon |
| TPU | Google (v5e, v6e Trillium) |
| NPU | Huawei Ascend (910B, 910C), AWS Trainium2, Intel Gaudi 3 |
How it works
- Run standardized benchmark scripts on your hardware
- Submit CSV results via pull request
- CI validates format and sanity checks
- Leaderboard updates automatically with per-kernel rankings
All results tagged with git SHA, driver version, toolchain, and number of runs. Median latency reported. Full methodology.
Built with ascend-rs kernel infrastructure. Data updated weekly.
Leaderboard
| # | Vendor | Device | Type | Kernel | Dtype | Shape | Latency (us) | GOPS | GOPS/$ | GOPS/W | Verified |
|---|
Cost Effectiveness
The most important metric for deployment decisions: how much real performance do you get per dollar and per watt?
| # | Device | Kernel | Latency (us) | MSRP ($) | TDP (W) | GOPS/$ | GOPS/W |
|---|
Methodology
Measurement protocol
- Warmup: 50 iterations discarded
- Measurement: 500 iterations, median latency reported
- Amortization: Dispatch overhead amortized by batching 500 kernel launches into one command buffer where supported
- Isolation: Benchmarks run on idle systems, no background GPU workloads
What we measure
Kernel-only time: the GPU/NPU execution time for a single kernel dispatch, excluding:
- Host-to-device data transfer (data assumed resident)
- Command buffer creation overhead (amortized)
- Python/framework overhead
This isolates the hardware+compiler efficiency from the software stack.
Reporting
| Metric | Definition |
|---|---|
| Latency (us) | Median kernel execution time in microseconds |
| GOPS | Throughput: operations / latency |
| GOPS/$ | Throughput / device MSRP in USD |
| GOPS/W | Throughput / TDP in watts |
Standardized configurations
Each kernel is benchmarked at these canonical shapes:
| Kernel | Shapes | Dtypes |
|---|---|---|
| Softmax | (1,1024), (64,1024), (64,4096) | f32, f16 |
| LayerNorm | (1,768), (64,768), (1024,768) | f32, f16 |
| GEMM | (1024,1024,1024), (4096,4096,4096) | f32, f16, bf16 |
| Attention | (1,32,128,128), (32,32,2048,128) | f32, f16 |
How to submit
See Submit Results.
Softmax
Category: Activation | Complexity: O(N) per row | Memory: 2 passes over input
Algorithm
The online 2-pass softmax (Milakov & Gimelshein 2018):
Pass 1 (single traversal): Maintain running (max, sum) pair per thread. When a new maximum is found, rescale the accumulated sum:
sum_new = sum_old * exp(max_old - max_new) + exp(x - max_new)
Pass 2: Write exp(x - global_max) / global_sum per element.
This is 33% less memory traffic than the naive 3-pass algorithm (max, exp+sum, normalize).
Benchmark configurations
| Shape | Elements | Bytes (f32) | Notes |
|---|---|---|---|
| (1, 1024) | 1K | 4 KB | L1-resident, tests dispatch overhead |
| (64, 1024) | 64K | 256 KB | L2-resident, typical batch |
| (64, 4096) | 256K | 1 MB | Bandwidth-bound regime |
Results
See Leaderboard filtered to Softmax for full results.
LayerNorm
Category: Normalization | Complexity: O(N) per row | Memory: 3 passes
Algorithm
3-pass fused: mean, variance, normalize+affine in one workgroup:
- Mean: Parallel sum reduction, divide by N
- Variance: Parallel sum of (x - mean)^2, compute inverse std
- Affine:
gamma * (x - mean) * inv_std + beta
Uses SIMD group shuffles for warp-level reductions (1 threadgroup barrier instead of 8).
Why ascend-rs beats MPS by 3x
- Single-pass kernel vs MPS’s separate dispatches
- No Python/ATen overhead (Rust metal crate -> Metal API directly)
- Fused command buffer (500 dispatches per commit)
- No intermediate buffer allocations
Benchmark configurations
| Shape | Notes |
|---|---|
| (1, 768) | GPT-2 hidden dim, single position |
| (64, 768) | Typical batch |
| (1024, 768) | Large batch |
See Leaderboard filtered to LayerNorm for full results.
Matmul
Benchmark data coming soon. Submit results to be the first!
See Leaderboard for available results.
Attention
Benchmark data coming soon. Submit results to be the first!
See Leaderboard for available results.
Vq Quantize
Benchmark data coming soon. Submit results to be the first!
See Leaderboard for available results.
Conv1d
Benchmark data coming soon. Submit results to be the first!
See Leaderboard for available results.
Rms Norm
Benchmark data coming soon. Submit results to be the first!
See Leaderboard for available results.
Gelu
Benchmark data coming soon. Submit results to be the first!
See Leaderboard for available results.
Financial Sidecar
Real-time context for xPU investment and procurement decisions.
Stock prices (AI chip vendors)
| Ticker | Company | Role |
|---|---|---|
| NVDA | NVIDIA | GPU market leader |
| AMD | AMD | MI300X, CDNA competitor |
| AAPL | Apple | M-series, Metal ecosystem |
| INTC | Intel | Gaudi, Habana |
| GOOG | TPU, custom silicon | |
| AMZN | Amazon | Trainium, Inferentia |
Device street prices
Tracking real-world prices (not MSRP) helps compute true cost-effectiveness:
| Device | MSRP | Street Price | Source |
|---|---|---|---|
| NVIDIA H100 SXM | $30,000 | Check latest | eBay, broker |
| NVIDIA A100 80GB | $10,000 | Check latest | eBay, broker |
| AMD MI300X | $15,000 | Check latest | AMD direct |
| Apple M4 Max (laptop) | $3,999 | Check latest | Apple Store |
Commodity reference
| Symbol | Relevance |
|---|---|
| Gold (XAU) | Store-of-value benchmark |
| Oil (WTI) | Energy cost proxy |
| BTC | Crypto mining demand affects GPU pricing |
| USD/CNY | Huawei/Cambricon pricing |
Price data updated weekly via scripts/fetch_prices.py.
Submit Results
CSV format
Create a CSV file named <device-slug>.csv with these columns:
device_id,kernel_id,dtype,input_shape,batch_size,impl_lang,latency_us,driver_version,toolchain,git_sha,submitter
nvidia-h100-sxm,softmax,f32,"[64, 1024]",1,cuda,12.3,CUDA 12.4,nvcc 12.4,abc1234,your-name
Steps
- Fork the pu-rs.org repo
- Add your CSV to
submissions/ - Open a pull request
- CI validates format and sanity checks
- Maintainers review and merge
Requirements
- Minimum 20 runs per (kernel, shape) pair
- Report median latency
- Include driver version and toolchain
- Device must exist in
db/seed_devices.sql(or add it in the same PR)
Running the benchmark
# Metal (Apple Silicon)
ASCEND_METAL_KERNELS=1 python3 scripts/bench_metal.py --device apple-m2-max-38
# CUDA (NVIDIA)
python3 scripts/bench_cuda.py --device nvidia-h100-sxm
# Ascend (Huawei NPU)
bash benchmarks/kernel_bench/bench.sh