Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Attention (Scaled Dot-Product)

Category: Attention | Complexity: O(B·H·S²·D) | Compute: Cube/Tensor Core bound

Algorithm

Scaled dot-product attention: Output = softmax(Q·K^T / √d) · V

The core transformer primitive — computes attention weights from queries and keys, applies softmax normalization, then produces a weighted sum of values. Dominates runtime in all transformer architectures (GPT, BERT, LLaMA, etc.).

Pipeline:

  1. Scores = Q × K^T — matmul (S×D) × (D×S) → (S×S)
  2. Scale by 1/√d — element-wise multiply
  3. Softmax along last axis — numerically stable (max → sub → exp → sum → div)
  4. Output = Weights × V — matmul (S×S) × (S×D) → (S×D)

ascend-rs Kernel Source

The attention pipeline in ascend-rs combines tile-API matmul with custom Rust kernels for scale and softmax:

#![allow(unused)]
fn main() {
use ascend_rs::prelude::*;

let scale = 1.0f32 / (d_k as f32).sqrt();

// Step 1: scores = Q × K^T  (HGEMM via cube engine)
acl_blas_hgemm(TransN, TransT, TransN,
    seq_len, seq_len, d_k,
    &alpha, &d_q, d_k, &d_k_mat, d_k,
    &beta, &mut d_scores, seq_len,
    HighPrecision, &stream)?;

// Step 2: scores *= 1/√d_k  (custom Rust kernel → NPU)
scale_kernel.launch(1, &stream, &mut [
    d_scores.as_mut_ptr(),  // in-place
    d_scores.as_mut_ptr(),  // output (same buffer)
    d_n_scores.as_mut_ptr(),
    d_scale.as_mut_ptr(),
])?;

// Step 3: weights = softmax(scores)  (custom Rust kernel → NPU)
softmax_kernel.launch(1, &stream, &mut [
    d_scores.as_mut_ptr(),
    d_weights.as_mut_ptr(),
    d_row_len.as_mut_ptr(),
    d_num_rows.as_mut_ptr(),
])?;

// Step 4: output = weights × V  (HGEMM via cube engine)
acl_blas_hgemm(TransN, TransN, TransN,
    seq_len, d_k, seq_len,
    &alpha, &d_weights, seq_len, &d_v, d_k,
    &beta, &mut d_output, d_k,
    HighPrecision, &stream)?;
}

The scale and softmax kernels are written in Rust and compiled via rustc_codegen_mlir → MLIR → backend code. The GEMMs use vendor-optimized libraries (aclnnMatmul, cuBLAS, MPSMatrixMultiplication).

Backend status for the fused safe::tile_attention_f32 tile op: Ascend AIV, Cambricon BANG, Intel Gaudi, Apple Metal, Vulkan SPIR-V (5/9). CUDA / AWS NKI / AMD AIE / Google TPU lowerings are TODO — on those backends the pipeline still runs as separate matmul + softmax + matmul dispatches using the individually-lowered tile ops.

Benchmark configurations

Shape (B, H, S, D)FLOPsNotes
(1, 1, 128, 64)4.2 MSmall baseline, dispatch overhead test
(1, 1, 512, 64)67 MMedium sequence
(1, 1, 1024, 64)268 MGPT-2 scale
(1, 1, 2048, 64)1.1 GLong context
(1, 1, 4096, 64)4.3 GVery long context (quadratic scaling)
(1, 8, 512, 64)537 M8-head, GPT-2 like
(1, 12, 512, 64)805 M12-head, BERT-base
(1, 32, 512, 64)2.1 G32-head, LLaMA-7B
(1, 32, 1024, 128)17.2 G32-head, LLaMA-2-7B
(1, 32, 2048, 128)68.7 G32-head, long context

All benchmarks use f16 input with f16 output. FLOPs ≈ 4·B·H·S²·D (two matmuls dominate).

Results

DeviceShape (B,H,S,D)Latency (μs)TFLOPSNotes
Ascend 910B(1,32,1024,128)31055.4aclnnMatmul+Softmax, manual pipeline
Ascend 910B(1,32,2048,128)1,45947.1Memory-bound at long context
Ascend 910B(1,1,4096,64)14928.9Single-head, large S
Ascend 910B(1,32,512,64)10520.432-head, short context
Tesla T4(1,32,1024,128)2,6096.6F.scaled_dot_product_attention
Tesla T4(1,32,2048,128)5,06713.6Flash attention backend
Tesla T4(1,1,4096,64)9744.4Single-head
Tesla T4(1,32,512,64)4275.032-head
Apple M2 Max(1,32,1024,128)138,8190.12MPS GEMM + CPU softmax
Apple M2 Max(1,1,4096,64)60,6470.07MPS GEMM + CPU softmax

Peak: 55.4 TFLOPS (f16) on Ascend 910B. Tesla T4 peaks at 13.6 TFLOPS (f16) via PyTorch SDPA. Apple M2 Max peaks at 0.14 TFLOPS — bottlenecked by CPU softmax (no fused MPS attention).

See Leaderboard filtered to Attention for the full filterable view.