Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GEMM (MatMul)

Category: Linear Algebra | Complexity: O(M·K·N) | Compute: Cube/Tensor Core bound

Algorithm

Dense matrix multiplication: C[M×N] = A[M×K] × B[K×N].

The fundamental ML primitive — dominates runtime in transformers (linear projections, attention scores, FFN layers). Performance depends on tiling strategy, memory hierarchy utilization, and hardware matrix units (cube engines, tensor cores).

ascend-rs Kernel Source

Matrix multiplication in ascend-rs uses the tile API, which compiles to hardware-specific matmul units (cube engine on Ascend, tensor cores on CUDA, etc.).

Safe entry form — kernel body is pure safe Rust, shape committed at the type level:

#![allow(unused)]
fn main() {
use ascend_std::tile::{GmView, GmViewMut, safe, tile_load_view_f32, tile_store_view_f32};

#[ascend_std::aiv_kernel]
pub fn tile_matmul(
    a:      GmView<'_, M, K, f32>,
    b:      GmView<'_, K, N, f32>,
    output: GmViewMut<'_, M, N, f32>,
) {
    let at = tile_load_view_f32(&a);
    let bt = tile_load_view_f32(&b);
    let c  = safe::tile_matmul_f32(at, bt);
    tile_store_view_f32(&output, c);
}
}

No unsafe blocks in the body. A mismatched K between the two operands is a compile-time error. The #[aiv_kernel] attribute rewrites the emitted signature to raw *const f32 / *mut f32 so the launcher toolchain sees an unchanged C ABI — #[repr(transparent)] makes the rewrite free at the LLVM IR level.

This compiles via rustc_codegen_mlir → MLIR → target code on all 9 backends:

  • Ascend AIV: PTO-MLIR pto.tmatmul → cube engine (320 TFLOPS f16 on 910B)
  • CUDA: __shared__ tiled GEMM with __syncthreads()
  • Apple Metal / Vulkan SPIR-V: compute shader with shared-memory tiling
  • AWS NKI (Trainium): nki.isa.nc_matmul
  • AMD AIE: AIE2P cascade matmul
  • Cambricon BANG: __bang_matmul on MLU tensor units
  • Intel Gaudi: HPU matmul intrinsic
  • Google TPU: XLA dot_general via OpenXLA

For benchmarking, vendor-optimized libraries are used: aclnnMatmul (Ascend), cuBLAS (CUDA), MPSMatrixMultiplication (Metal).

Benchmark configurations

Shape (A × B)FLOPsNotes
[1024, 1024] × [1024, 1024]2.1 GSmall square, tests dispatch overhead
[4096, 4096] × [4096, 4096]137 GStandard benchmark, bandwidth→compute transition
[8192, 8192] × [8192, 8192]1.1 TLarge square, saturates compute units
[16384, 16384] × [16384, 16384]8.8 TFull hardware saturation
[1024, 4096] × [4096, 1024]8.6 GRectangular, typical FFN down-projection
[4096, 1024] × [1024, 4096]34.4 GRectangular, typical FFN up-projection
[2048, 8192] × [8192, 2048]67.1 GTransformer-scale attention projection

All benchmarks use f16 input with f16 output (or f32 accumulation where supported).

Results

DeviceShapeLatency (μs)TFLOPSGOPS/W
Ascend 910B[4096²]×[4096²]437314.51014
Ascend 910B[8192²]×[8192²]3,614304.2981
Ascend 910B[16384²]×[16384²]27,467320.21033
Ascend 910B[2048, 8192]×[8192, 2048]245280.0903
Ascend 910B[4096, 1024]×[1024, 4096]132260.1839
Apple M2 Max[4096²]×[4096²]17,3747.953
Apple M2 Max[8192²]×[8192²]139,5967.953
Apple M2 Max[2048, 8192]×[8192, 2048]8,9727.751
Apple M2 Max[4096, 1024]×[1024, 4096]4,3457.953
Tesla T4[4096²]×[4096²]5,69824.1345
Tesla T4[8192²]×[8192²]44,09924.9356
Tesla T4[2048, 8192]×[8192, 2048]2,56726.8383
Tesla T4[4096, 1024]×[1024, 4096]1,54922.2317

Peak: 320 TFLOPS (f16) on Ascend 910B — saturating the theoretical maximum. Tesla T4 peaks at 26.8 TFLOPS (f16) via cuBLAS (torch.matmul). Apple M2 Max peaks at 7.9 TFLOPS (f16) via MPSMatrixMultiplication.

See Leaderboard filtered to MatMul for the full filterable view.