Dilated Conv1D + ReLU
Category: Convolution | Complexity: O(B·T·C²·3) | Fusion: pad + gather + matmul + ReLU
Algorithm
A dilated 1D convolution with kernel size 3, fused with bias and ReLU activation. Used in VQ-VAE encoder/decoder ResConv1DBlocks (e.g., SOKE, Jukebox-style models).
The naive implementation requires:
- Pad the input by
dilationon each side (zero-padding) - Gather 3 positions per output:
[t-d, t, t+d] - Concat along the channel axis ->
(B, T, 3C) - Linear projection
(3C -> C) - ReLU
This kernel fuses all 5 steps into a single GPU pass, eliminating three intermediate (B, T, 3C) buffer allocations and the data shuffles between them.
Why fusion matters
For an 18-block VQ-VAE encoder/decoder, the unfused version allocates 54 intermediate tensors per forward pass and reads them back. Fusing into one kernel:
- Eliminates intermediate buffer writes/reads (3x memory bandwidth reduction)
- Keeps activations in registers/L1 cache between stages
- One command buffer dispatch instead of five
ascend-rs Kernel Source
Vectorized dilated conv1d + ReLU using ascend-rs buffer API (f32, benchmarked implementation):
#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub fn conv1d_dilated(input: *const f32, output: *mut f32, params: *const u32) {
let n = *params;
let dilation = *params.wrapping_add(1);
let w0 = f32::from_bits(*params.wrapping_add(2));
let w1 = f32::from_bits(*params.wrapping_add(3));
let w2 = f32::from_bits(*params.wrapping_add(4));
let bias = f32::from_bits(*params.wrapping_add(5));
let aligned_n = ((n + 7) / 8) * 8;
let in_buf = ascend_std::ascend_buf_alloc(aligned_n);
let tap_left = ascend_std::ascend_buf_alloc(aligned_n);
let tap_right = ascend_std::ascend_buf_alloc(aligned_n);
let acc = ascend_std::ascend_buf_alloc(aligned_n);
let work = ascend_std::ascend_buf_alloc(aligned_n);
ascend_std::ascend_buf_load_f32(in_buf, input, n);
ascend_std::ascend_pipe_barrier();
// Build shifted tap buffers with zero-padding
ascend_std::ascend_buf_fill_f32(tap_left, 0.0, aligned_n);
let mut i = dilation;
while i < n {
let v = ascend_std::ascend_get_value_f32(in_buf, i - dilation);
ascend_std::ascend_set_value_f32(tap_left, i, v);
i += 1;
}
ascend_std::ascend_buf_fill_f32(tap_right, 0.0, aligned_n);
i = 0;
while i + dilation < n {
let v = ascend_std::ascend_get_value_f32(in_buf, i + dilation);
ascend_std::ascend_set_value_f32(tap_right, i, v);
i += 1;
}
// Vector MAC: acc = tap_left*w0 + input*w1 + tap_right*w2 + bias
ascend_std::ascend_muls_f32(acc, tap_left, w0, n);
ascend_std::ascend_muls_f32(work, in_buf, w1, n);
ascend_std::ascend_add_f32(tap_left, acc, work, n);
ascend_std::ascend_muls_f32(work, tap_right, w2, n);
ascend_std::ascend_add_f32(acc, tap_left, work, n);
ascend_std::ascend_adds_f32(acc, acc, bias, n);
ascend_std::ascend_maxs_f32(acc, acc, 0.0, n); // ReLU
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_buf_store_f32(output, acc, n);
}
}
This buffer-API kernel runs on the Ascend AIV backend via rustc_codegen_mlir. No tile-API safe::tile_conv1d_f32 currently exists — tile-API lowerings on all 9 backends (Ascend AIV / CUDA / Apple Metal / Vulkan SPIR-V / AWS NKI / AMD AIE / Cambricon BANG / Intel Gaudi / Google TPU) are future work. On non-Ascend backends the fused pad+gather+matmul+ReLU is currently expressed as a buffer-API composition rather than a single tile op.
Benchmark configurations
| Shape (B, T, C) | Elements | Notes |
|---|---|---|
| (2, 50, 512) | 51 K | Single VQ-VAE block, small batch |
| (8, 100, 512) | 410 K | Mid-sized clip |
| (2, 400, 512) | 410 K | Long sequence |
Results
See Leaderboard filtered to conv1d-dilated for the full filterable view.