Attention (Scaled Dot-Product)

Category: Attention | Complexity: O(B·H·S²·D) | Compute: Cube/Tensor Core bound

Algorithm

Scaled dot-product attention: Output = softmax(Q·K^T / √d) · V

The core transformer primitive — computes attention weights from queries and keys, applies softmax normalization, then produces a weighted sum of values. Dominates runtime in all transformer architectures (GPT, BERT, LLaMA, etc.).

Pipeline:

Scores = Q × K^T — matmul (S×D) × (D×S) → (S×S)
Scale by 1/√d — element-wise multiply
Softmax along last axis — numerically stable (max → sub → exp → sum → div)
Output = Weights × V — matmul (S×S) × (S×D) → (S×D)

ascend-rs Kernel Source

The attention pipeline in ascend-rs combines tile-API matmul with custom Rust kernels for scale and softmax:

#![allow(unused)]
fn main() {
use ascend_rs::prelude::*;

let scale = 1.0f32 / (d_k as f32).sqrt();

// Step 1: scores = Q × K^T  (HGEMM via cube engine)
acl_blas_hgemm(TransN, TransT, TransN,
    seq_len, seq_len, d_k,
    &alpha, &d_q, d_k, &d_k_mat, d_k,
    &beta, &mut d_scores, seq_len,
    HighPrecision, &stream)?;

// Step 2: scores *= 1/√d_k  (custom Rust kernel → NPU)
scale_kernel.launch(1, &stream, &mut [
    d_scores.as_mut_ptr(),  // in-place
    d_scores.as_mut_ptr(),  // output (same buffer)
    d_n_scores.as_mut_ptr(),
    d_scale.as_mut_ptr(),
])?;

// Step 3: weights = softmax(scores)  (custom Rust kernel → NPU)
softmax_kernel.launch(1, &stream, &mut [
    d_scores.as_mut_ptr(),
    d_weights.as_mut_ptr(),
    d_row_len.as_mut_ptr(),
    d_num_rows.as_mut_ptr(),
])?;

// Step 4: output = weights × V  (HGEMM via cube engine)
acl_blas_hgemm(TransN, TransN, TransN,
    seq_len, d_k, seq_len,
    &alpha, &d_weights, seq_len, &d_v, d_k,
    &beta, &mut d_output, d_k,
    HighPrecision, &stream)?;
}

The scale and softmax kernels are written in Rust and compiled via rustc_codegen_mlir → MLIR → backend code. The GEMMs use vendor-optimized libraries (aclnnMatmul, cuBLAS, MPSMatrixMultiplication).

Backend status for the fused safe::tile_attention_f32 tile op: Ascend AIV, Cambricon BANG, Intel Gaudi, Apple Metal, Vulkan SPIR-V (5/9). CUDA / AWS NKI / AMD AIE / Google TPU lowerings are TODO — on those backends the pipeline still runs as separate matmul + softmax + matmul dispatches using the individually-lowered tile ops.

Benchmark configurations

Shape (B, H, S, D)	FLOPs	Notes
(1, 1, 128, 64)	4.2 M	Small baseline, dispatch overhead test
(1, 1, 512, 64)	67 M	Medium sequence
(1, 1, 1024, 64)	268 M	GPT-2 scale
(1, 1, 2048, 64)	1.1 G	Long context
(1, 1, 4096, 64)	4.3 G	Very long context (quadratic scaling)
(1, 8, 512, 64)	537 M	8-head, GPT-2 like
(1, 12, 512, 64)	805 M	12-head, BERT-base
(1, 32, 512, 64)	2.1 G	32-head, LLaMA-7B
(1, 32, 1024, 128)	17.2 G	32-head, LLaMA-2-7B
(1, 32, 2048, 128)	68.7 G	32-head, long context

All benchmarks use f16 input with f16 output. FLOPs ≈ 4·B·H·S²·D (two matmuls dominate).

Results

Device	Shape (B,H,S,D)	Latency (μs)	TFLOPS	Notes
Ascend 910B	(1,32,1024,128)	310	55.4	aclnnMatmul+Softmax, manual pipeline
Ascend 910B	(1,32,2048,128)	1,459	47.1	Memory-bound at long context
Ascend 910B	(1,1,4096,64)	149	28.9	Single-head, large S
Ascend 910B	(1,32,512,64)	105	20.4	32-head, short context
Tesla T4	(1,32,1024,128)	2,609	6.6	F.scaled_dot_product_attention
Tesla T4	(1,32,2048,128)	5,067	13.6	Flash attention backend
Tesla T4	(1,1,4096,64)	974	4.4	Single-head
Tesla T4	(1,32,512,64)	427	5.0	32-head
Apple M2 Max	(1,32,1024,128)	138,819	0.12	MPS GEMM + CPU softmax
Apple M2 Max	(1,1,4096,64)	60,647	0.07	MPS GEMM + CPU softmax

Peak: 55.4 TFLOPS (f16) on Ascend 910B. Tesla T4 peaks at 13.6 TFLOPS (f16) via PyTorch SDPA. Apple M2 Max peaks at 0.14 TFLOPS — bottlenecked by CPU softmax (no fused MPS attention).

See Leaderboard filtered to Attention for the full filterable view.

Keyboard shortcuts

pu-rs.org — xPU Kernel Benchmark

Attention (Scaled Dot-Product)

Algorithm

ascend-rs Kernel Source

Benchmark configurations

Results