GELU

Category: Activation | Complexity: O(N) elementwise | Memory: 1 pass (fused read+write)

Algorithm

GELU (Gaussian Error Linear Unit, Hendrycks & Gimpel 2016) is the standard activation in BERT, GPT, LLaMA, and most transformer models:

GELU(x) = x · Φ(x) = x · 0.5 · (1 + erf(x / √2))

The fast tanh approximation (used in PyTorch gelu(approximate='tanh')):

GELU(x) ≈ 0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³)))

GELU is memory-bandwidth bound — the compute-to-byte ratio is low (a few FLOPs per 4-byte element), so peak throughput is measured in GB/s rather than TFLOPS.

ascend-rs Kernel Source

GELU using ascend-rs buffer API (f32, tanh approximation):

#![allow(unused)]
fn main() {
/// GELU activation: y = 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
///
/// params: [n: u32]
#[ascend_std::aiv_kernel]
pub fn gelu(
    input: *const f32,
    output: *mut f32,
    params: *const u32,
) {
    let n = *params as usize;
    let sqrt_2_pi: f32 = 0.7978845608; // sqrt(2/pi)
    let coeff: f32 = 0.044715;

    let mut i: usize = 0;
    while i < n {
        let x = *input.wrapping_add(i);
        let x3 = x * x * x;
        let inner = sqrt_2_pi * (x + coeff * x3);
        // tanh via exp: tanh(z) = (e^2z - 1)/(e^2z + 1)
        let e2z = (2.0 * inner).exp();
        let tanh_val = (e2z - 1.0) / (e2z + 1.0);
        *output.wrapping_add(i) = 0.5 * x * (1.0 + tanh_val);
        i += 1;
    }
}
}

Vectorized version using buffer intrinsics:

#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub fn gelu_vec(
    input: *const f32,
    output: *mut f32,
    params: *const u32,
) {
    let n = *params;
    let in_buf = ascend_std::ascend_buf_alloc(n);
    let work = ascend_std::ascend_buf_alloc(n);
    let work2 = ascend_std::ascend_buf_alloc(n);

    ascend_std::ascend_buf_load_f32(in_buf, input, n);
    ascend_std::ascend_pipe_barrier();

    // x³
    ascend_std::ascend_mul_f32(work, in_buf, in_buf, n);
    ascend_std::ascend_pipe_barrier();
    ascend_std::ascend_mul_f32(work, work, in_buf, n);
    ascend_std::ascend_pipe_barrier();
    // 0.044715 * x³
    ascend_std::ascend_muls_f32(work, work, 0.044715, n);
    ascend_std::ascend_pipe_barrier();
    // x + 0.044715 * x³
    ascend_std::ascend_add_f32(work, in_buf, work, n);
    ascend_std::ascend_pipe_barrier();
    // sqrt(2/pi) * (x + 0.044715 * x³)
    ascend_std::ascend_muls_f32(work, work, 0.7978845608, n);
    ascend_std::ascend_pipe_barrier();

    // Store result
    ascend_std::ascend_buf_store_f32(output, work, n);
}
}

These buffer-API kernels run on the Ascend AIV backend via rustc_codegen_mlir. No tile-API safe::tile_gelu_f32 currently exists — tile-API lowerings on all 9 backends (Ascend AIV / CUDA / Apple Metal / Vulkan SPIR-V / AWS NKI / AMD AIE / Cambricon BANG / Intel Gaudi / Google TPU) are future work. Cross-backend execution today goes through the buffer-API scalar loop or the element-wise intrinsic composition shown above.

Benchmark configurations

Shape	Elements	Bytes (f32)	Notes
(1, 768)	768	3 KB	GPT-2 hidden dim
(1, 4096)	4K	16 KB	LLaMA hidden dim
(64, 768)	49K	192 KB	Typical batch
(64, 4096)	262K	1 MB	Bandwidth-bound
(1024, 4096)	4.2M	16 MB	Large batch

All benchmarks use f32.

Results

See Leaderboard filtered to GELU for the full filterable view.

Keyboard shortcuts

pu-rs.org — xPU Kernel Benchmark

GELU

Algorithm

ascend-rs Kernel Source

Benchmark configurations

Results