Skip to content

FFPA Attention Benchmark

Quick Start

python3 -m ffpa_attn.bench # default: forward + backward w/o autotuning
python3 -m ffpa_attn.bench --no-bwd # only forward pass
python3 -m ffpa_attn.bench --no-fwd # only backward pass
python3 -m ffpa_attn.bench --fwd-backend triton --bwd-backend triton --tune fast
python3 -m ffpa_attn.bench --fwd-backend triton --bwd-backend triton --tune max
python3 -m ffpa_attn.bench --fwd-backend triton --bwd-backend triton --tune max --fwd-tma --bwd-tma # SM>=90
python3 -m ffpa_attn.bench --fwd-backend cutedsl --bwd-backend cutedsl # SM==90 + dense 320<D<=512

The ffpa-attn bench CLI (python -m ffpa_attn.bench) migrated benchmark plotting entrypoint. It preserves the old plot style, can benchmark forward/backward cases on demand, and writes both ffpa_{device}_speedup.png and ffpa_{device}_speedup.md. The additive-mask example uses a compact [1, 1, 1, Nkv] key-position bias by default. Use [1, 1, Nq, Nkv] only when per-query bias is required, since it scales as O(Nq * Nkv) memory.

Benchmark

TFLOPS reports the theoretical dominant attention GEMM throughput only; forward and backward are computed separately from the measured latency. Env: NVIDIA L20 (Ada, 119.5 TFLOPS) and NVIDIA H200, PyTorch 2.11, CUDA 13.0, Headdim=512 (FA-2 not supported).

Forward Pass (Triton, NVIDIA L20, 8K, D=512)

Case dtype Nq/Nkv FFPA / SDPA TFLOPS speedup
self-attn fp16 8192/8192 45.40 / 74.76 ms 97T / 59T 1.65x
self-attn bf16 8192/8192 45.08 / 74.63 ms 98T / 59T 1.66x
cross-attn fp16 1024/8192 6.31 / 10.05 ms 87T / 55T 1.59x
cross-attn bf16 1024/8192 6.14 / 10.10 ms 89T / 54T 1.64x
decode-attn fp16 1/8192 0.77 / 0.80 ms 0.69T / 0.67T 1.03x
decode-attn bf16 1/8192 0.77 / 0.80 ms 0.69T / 0.67T 1.04x
gqa fp16 8192/8192 45.35 / 74.68 ms 97T / 59T 1.65x
gqa bf16 8192/8192 44.93 / 74.70 ms 98T / 59T 1.66x
causal fp16 8192/8192 25.11 / 37.31 ms 88T / 59T 1.49x
causal bf16 8192/8192 24.71 / 37.48 ms 89T / 59T 1.52x
attn-mask fp16 8192/8192 51.79 / 80.66 ms 85T / 55T 1.56x
attn-mask bf16 8192/8192 48.47 / 80.78 ms 91T / 54T 1.67x
dropout fp16 8192/8192 53.49 / 82.95 ms 82T / 53T 1.55x
dropout bf16 8192/8192 53.00 / 82.83 ms 83T / 53T 1.56x
non-aligned fp16 8191/8191 11.96 / 19.13 ms 92T / 57T 1.60x
non-aligned bf16 8191/8191 11.90 / 19.13 ms 92T / 57T 1.61x

Backward Pass (Triton, NVIDIA L20, 8K, D=512)

Case dtype Nq/Nkv FFPA / SDPA TFLOPS speedup
self-attn fp16 8192/8192 191.89 / 351.96 ms 57T / 31T 1.83x
self-attn bf16 8192/8192 192.10 / 353.15 ms 57T / 31T 1.84x
cross-attn fp16 1024/8192 26.14 / 42.69 ms 53T / 32T 1.63x
cross-attn bf16 1024/8192 25.91 / 42.56 ms 53T / 32T 1.64x
decode-attn fp16 1/8192 2.65 / 5.81 ms 0.51T / 0.23T 2.20x
decode-attn bf16 1/8192 2.65 / 5.85 ms 0.51T / 0.23T 2.21x
gqa fp16 8192/8192 194.90 / 349.08 ms 56T / 31T 1.79x
gqa bf16 8192/8192 194.04 / 349.12 ms 57T / 31T 1.80x
causal fp16 8192/8192 95.85 / 191.50 ms 57T / 29T 2.00x
causal bf16 8192/8192 96.05 / 190.57 ms 57T / 29T 1.98x
attn-mask fp16 8192/8192 196.93 / 379.14 ms 56T / 29T 1.93x
attn-mask bf16 8192/8192 195.79 / 380.01 ms 56T / 29T 1.94x
dropout fp16 8192/8192 203.18 / 353.68 ms 54T / 31T 1.74x
dropout bf16 8192/8192 201.93 / 354.71 ms 54T / 31T 1.76x
non-aligned fp16 8191/8191 52.38 / 100.03 ms 52T / 27T 1.91x
non-aligned bf16 8191/8191 50.46 / 99.96 ms 54T / 27T 1.98x

Forward Pass (CuTeDSL, NVIDIA H200, 8K, D=512)

Case dtype Nq/Nkv FFPA / SDPA TFLOPS speedup
self-attn fp16 8192/8192 11.42 / 33.10 ms 385T / 133T 2.90x
self-attn bf16 8192/8192 11.82 / 32.47 ms 372T / 135T 2.75x
cross-attn fp16 1024/8192 2.25 / 4.07 ms 244T / 135T 1.81x
cross-attn bf16 1024/8192 2.22 / 4.03 ms 247T / 137T 1.81x
gqa fp16 8192/8192 11.46 / 33.36 ms 384T / 132T 2.91x
gqa bf16 8192/8192 10.92 / 32.52 ms 403T / 135T 2.98x
causal fp16 8192/8192 6.54 / 16.99 ms 336T / 129T 2.60x
causal bf16 8192/8192 6.34 / 16.76 ms 347T / 131T 2.64x
non-aligned fp16 8191/8191 3.05 / 8.05 ms 361T / 137T 2.64x
non-aligned bf16 8191/8191 3.06 / 7.93 ms 359T / 139T 2.59x

Backward Pass (CuTeDSL, NVIDIA H200, 8K, D=512)

Case dtype Nq/Nkv FFPA / SDPA TFLOPS speedup
self-attn fp16 8192/8192 48.67 / 239.57 ms 226T / 46T 4.92x
self-attn bf16 8192/8192 48.38 / 239.17 ms 227T / 46T 4.94x
cross-attn fp16 1024/8192 7.13 / 29.59 ms 193T / 46T 4.15x
cross-attn bf16 1024/8192 7.11 / 29.49 ms 193T / 47T 4.15x
gqa fp16 8192/8192 46.38 / 239.77 ms 237T / 46T 5.17x
gqa bf16 8192/8192 46.16 / 239.07 ms 238T / 46T 5.18x
causal fp16 8192/8192 25.84 / 130.43 ms 213T / 42T 5.05x
causal bf16 8192/8192 25.38 / 130.72 ms 217T / 42T 5.15x
non-aligned fp16 8191/8191 11.88 / 71.09 ms 231T / 39T 5.98x
non-aligned bf16 8191/8191 11.81 / 70.83 ms 233T / 39T 6.00x

Forward Pass (CuTeDSL, NVIDIA H200, 16K, D=512)

Case dtype Nq/Nkv FFPA / SDPA TFLOPS speedup
self-attn fp16 16384/16384 43.11 / 144.70 ms 408T / 122T 3.36x
self-attn bf16 16384/16384 42.28 / 143.35 ms 416T / 123T 3.39x
cross-attn fp16 1024/16384 4.06 / 8.72 ms 271T / 126T 2.15x
cross-attn bf16 1024/16384 4.16 / 8.64 ms 264T / 127T 2.08x
gqa fp16 16384/16384 42.17 / 144.28 ms 417T / 122T 3.42x
gqa bf16 16384/16384 41.31 / 142.98 ms 426T / 123T 3.46x
causal fp16 16384/16384 23.89 / 74.92 ms 368T / 117T 3.14x
causal bf16 16384/16384 23.09 / 74.20 ms 381T / 119T 3.21x
non-aligned fp16 16383/16383 11.16 / 34.46 ms 394T / 128T 3.09x
non-aligned bf16 16383/16383 10.90 / 33.67 ms 403T / 131T 3.09x

Backward Pass (CuTeDSL, NVIDIA H200, 16K, D=512)

Case dtype Nq/Nkv FFPA / SDPA TFLOPS speedup
self-attn fp16 16384/16384 188.43 / 937.25 ms 233T / 47T 4.97x
self-attn bf16 16384/16384 187.01 / 933.86 ms 235T / 47T 4.99x
cross-attn fp16 1024/16384 13.92 / 56.94 ms 197T / 48T 4.09x
cross-attn bf16 1024/16384 13.87 / 56.95 ms 198T / 48T 4.10x
gqa fp16 16384/16384 184.66 / 933.70 ms 238T / 47T 5.06x
gqa bf16 16384/16384 183.01 / 934.03 ms 240T / 47T 5.10x
causal fp16 16384/16384 98.89 / 496.05 ms 222T / 44T 5.02x
causal bf16 16384/16384 97.10 / 497.01 ms 226T / 44T 5.12x
non-aligned fp16 16383/16383 46.30 / 252.40 ms 237T / 44T 5.45x
non-aligned bf16 16383/16383 45.80 / 253.24 ms 240T / 43T 5.53x

Forward Pass (Triton w/ autotune (max), NVIDIA H20, 8K, D=512)

Case dtype Nq/Nkv FFPA / SDPA TFLOPS speedup
self-attn fp16 8192/8192 39.26 / 69.13 ms 112T / 64T 1.76x
self-attn bf16 8192/8192 39.13 / 69.13 ms 112T / 64T 1.77x
cross-attn fp16 1024/8192 5.95 / 9.35 ms 92T / 59T 1.57x
cross-attn bf16 1024/8192 5.94 / 9.35 ms 93T / 59T 1.57x
decode-attn fp16 1/8192 0.30 / 1.00 ms 1.8T / 0.54T 3.31x
decode-attn bf16 1/8192 0.29 / 0.99 ms 1.9T / 0.54T 3.46x
gqa fp16 8192/8192 39.78 / 69.13 ms 111T / 64T 1.74x
gqa bf16 8192/8192 39.16 / 69.14 ms 112T / 64T 1.77x
causal fp16 8192/8192 19.62 / 35.56 ms 112T / 62T 1.81x
causal bf16 8192/8192 19.63 / 35.52 ms 112T / 62T 1.81x
attn-mask fp16 8192/8192 43.40 / 70.30 ms 101T / 63T 1.62x
attn-mask bf16 8192/8192 43.92 / 70.30 ms 100T / 63T 1.60x
dropout fp16 8192/8192 46.36 / 73.53 ms 95T / 60T 1.59x
dropout bf16 8192/8192 46.34 / 73.53 ms 95T / 60T 1.59x
non-aligned fp16 8191/8191 10.09 / 17.81 ms 109T / 62T 1.76x
non-aligned bf16 8191/8191 10.08 / 17.83 ms 109T / 62T 1.77x

Backward Pass (Triton w/ autotune (max), NVIDIA H20, 8K, D=512)

Case dtype Nq/Nkv FFPA / SDPA TFLOPS speedup
self-attn fp16 8192/8192 119.25 / 389.81 ms 92T / 28T 3.27x
self-attn bf16 8192/8192 119.27 / 389.09 ms 92T / 28T 3.26x
cross-attn fp16 1024/8192 14.86 / 49.60 ms 93T / 28T 3.34x
cross-attn bf16 1024/8192 14.88 / 49.71 ms 92T / 28T 3.34x
decode-attn fp16 1/8192 0.98 / 5.91 ms 1.4T / 0.23T 6.05x
decode-attn bf16 1/8192 1.01 / 6.01 ms 1.3T / 0.22T 5.93x
gqa fp16 8192/8192 119.24 / 388.70 ms 92T / 28T 3.26x
gqa bf16 8192/8192 119.25 / 388.83 ms 92T / 28T 3.26x
causal fp16 8192/8192 65.64 / 207.05 ms 84T / 27T 3.15x
causal bf16 8192/8192 65.51 / 207.61 ms 84T / 26T 3.17x
attn-mask fp16 8192/8192 141.89 / 397.86 ms 77T / 28T 2.80x
attn-mask bf16 8192/8192 142.40 / 399.49 ms 77T / 28T 2.81x
dropout fp16 8192/8192 130.43 / 395.33 ms 84T / 28T 3.03x
dropout bf16 8192/8192 131.32 / 398.72 ms 84T / 28T 3.04x
non-aligned fp16 8191/8191 31.87 / 108.35 ms 86T / 25T 3.40x
non-aligned bf16 8191/8191 31.96 / 108.25 ms 86T / 25T 3.39x

Forward Pass (Triton w/ autotune (max), NVIDIA H20, 8K, D=320)

Case dtype Nq/Nkv FFPA / SDPA TFLOPS speedup
self-attn fp16 8192/8192 21.52 / 48.22 ms 128T / 57T 2.24x
self-attn bf16 8192/8192 21.42 / 48.22 ms 128T / 57T 2.25x
cross-attn fp16 1024/8192 3.30 / 6.54 ms 104T / 53T 1.98x
cross-attn bf16 1024/8192 3.20 / 6.53 ms 107T / 53T 2.04x
decode-attn fp16 1/8192 0.29 / 0.73 ms 1.2T / 0.46T 2.55x
decode-attn bf16 1/8192 0.30 / 0.73 ms 1.1T / 0.46T 2.48x
gqa fp16 8192/8192 21.26 / 48.23 ms 129T / 57T 2.27x
gqa bf16 8192/8192 21.15 / 48.22 ms 130T / 57T 2.28x
causal fp16 8192/8192 11.08 / 24.79 ms 124T / 55T 2.24x
causal bf16 8192/8192 11.08 / 24.77 ms 124T / 55T 2.23x
attn-mask fp16 8192/8192 33.08 / 49.66 ms 83T / 55T 1.50x
attn-mask bf16 8192/8192 33.18 / 49.65 ms 83T / 55T 1.50x
dropout fp16 8192/8192 34.48 / 52.81 ms 80T / 52T 1.53x
dropout bf16 8192/8192 34.44 / 52.80 ms 80T / 52T 1.53x
non-aligned fp16 8191/8191 5.55 / 12.43 ms 124T / 55T 2.24x
non-aligned bf16 8191/8191 5.55 / 12.42 ms 124T / 55T 2.24x

Backward Pass (Triton w/ autotune (max), NVIDIA H20, 8K, D=320)

Case dtype Nq/Nkv FFPA / SDPA TFLOPS speedup
self-attn fp16 8192/8192 72.95 / 262.78 ms 94T / 26T 3.60x
self-attn bf16 8192/8192 72.90 / 261.45 ms 94T / 26T 3.59x
cross-attn fp16 1024/8192 9.42 / 33.10 ms 91T / 26T 3.51x
cross-attn bf16 1024/8192 9.42 / 33.06 ms 91T / 26T 3.51x
decode-attn fp16 1/8192 0.75 / 3.97 ms 1.1T / 0.21T 5.29x
decode-attn bf16 1/8192 0.74 / 4.04 ms 1.1T / 0.21T 5.45x
gqa fp16 8192/8192 73.29 / 262.47 ms 94T / 26T 3.58x
gqa bf16 8192/8192 73.11 / 260.93 ms 94T / 26T 3.57x
causal fp16 8192/8192 38.66 / 138.88 ms 89T / 25T 3.59x
causal bf16 8192/8192 38.75 / 138.64 ms 89T / 25T 3.58x
attn-mask fp16 8192/8192 81.71 / 269.02 ms 84T / 26T 3.29x
attn-mask bf16 8192/8192 82.85 / 269.36 ms 83T / 26T 3.25x
dropout fp16 8192/8192 80.60 / 268.33 ms 85T / 26T 3.33x
dropout bf16 8192/8192 81.08 / 270.67 ms 85T / 25T 3.34x
non-aligned fp16 8191/8191 19.70 / 72.14 ms 87T / 24T 3.66x
non-aligned bf16 8191/8191 19.72 / 72.10 ms 87T / 24T 3.66x

The performance benchmarks for the NVIDIA L20 (Ada), NVIDIA Geforce RTX 5090 (Blackwell), NVIDIA H800 PCIE (Hopper), NVIDIA H200 SXM (Hopper, CuTeDSL backend, up to 427 TFLOPS!🎉) with large headdim are shown below: