Methodology

Measurement protocol

Warmup: 50 iterations discarded
Measurement: 500 iterations, median latency reported
Amortization: Dispatch overhead amortized by batching 500 kernel launches into one command buffer where supported
Isolation: Benchmarks run on idle systems, no background GPU workloads

Kernel-only time: the GPU/NPU execution time for a single kernel dispatch, excluding:

This isolates the hardware+compiler efficiency from the software stack.

Metric	Definition
Latency (us)	Median kernel execution time in microseconds
GOPS	Throughput: operations / latency
GOPS/$	Throughput / device MSRP in USD
GOPS/W	Throughput / TDP in watts

Each kernel is benchmarked at these canonical shapes: