Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DeepSeek Decode Throughput (Cross-Vendor)

Per-kernel latency (the rest of this site) is the microbenchmark view. This page reports the complementary end-to-end decode throughput for a full DeepSeek-R1-Distill-Qwen-1.5B decode step, measured on five production accelerators from the same 13-kernel Rust source emitted through ascend-rs’s per-vendor MLIR backends.

Headline

RankDeviceBackendEmitted languageDecode tok/s
1Huawei Ascend 910B2mlir_to_cpp+mlir_to_ptoAscendC C++ + PTO-MLIR168.9
2Google TPU v2-8 (Colab)mlir_to_tpuPallas162.9
3Apple M2 Maxmlir_to_mslMetal91.7
4NVIDIA T4 (Colab)mlir_to_gpuCUDA53.7
5AWS Trainium1 (trn1.2xlarge)mlir_to_nkiNKI Python12.2
CPU referenceplain Rust3.7

All five numbers come from identical Rust kernel sources. The MLIR → vendor backend is the only thing that changes between rows. 168.9 tok/s on 910B2 is 2.47× the aclnn-only baseline and 45.6× the CPU reference.

Why this complements the kernel leaderboard

The kernel leaderboard tells you which chip runs a given softmax or GEMM fastest in isolation. Decode tok/s tells you what a real inference workload actually achieves once those kernels are composed with host-side launch overhead, KV cache traffic, and HBM pressure. A chip can win per-kernel and still lose on decode (the Trainium row is the clearest example — strong per-op latency, 9.5% bandwidth utilisation end-to-end).

Notes per device

  • Ascend 910B2 — the +pto half of the joint path contributes the four decode matmul shapes (1.75×–2.98× vs aclnn). RMSNorm stays on CPU; every other op is on-NPU. See the main ascend-rs blog ch10 for the per-kernel breakdown.
  • TPU v2-8 — Colab-visible Pallas on a 4-chip v2 pod; the emitted kernel set is rms_norm, matvec_f16, and an attention fusion.
  • Apple M2 Max — emitted Metal beats Apple’s hand-tuned MLX on decode at this model size. Measured via deepseek_metal.
  • NVIDIA T4 — Colab Tesla T4; same three kernels as the TPU row, emitted in CUDA. 53.7 tok/s is below-roofline (T4 is HBM-bandwidth-starved for this model shape).
  • AWS Trainium1trn1.2xlarge. Six emitted NKI kernels (rms_norm, three matvec_f16 variants, gate_up_silu). Traced via torch_neuronx.trace in two halves (eager single-NEFF runs at 2.5 tok/s — 5× slower — because the single compile unit can’t pipeline across the whole decode path). Trace time: 461 s; wall time: 5.23 s for 64 decode steps.

Reproduction

Each number is reproducible with the commands documented in the per-device sections of ascend-rs ch10. The source kernels live at crates/ascend_std/src/tile.rs in the public repo; the per-vendor emitters live in crates/rustc_codegen_mlir/src/mlir_to_*.rs.


The per-kernel leaderboard remains the authoritative view for isolated-kernel efficiency. Decode throughput is reported here as the complementary end-to-end metric.