Triton Autotune and Persistent Tuned Configs¶
FFPA's Triton backend can autotune forward and backward launch parameters for large-head-dimension attention. The autotune result can be persisted as a device-specific JSON file and reused later when runtime autotune is disabled.
This is useful for production inference or training jobs where you want stable startup latency and do not want each process to pay the Triton autotune cost.
Overview¶
There are two ways to use Triton tuned configs:
- Run with runtime autotune enabled for one process:
from ffpa_attn import TritonBackend, ffpa_attn_func
out = ffpa_attn_func(
q,
k,
v,
forward_backend=TritonBackend(
forward=True,
autotune=True,
autotune_mode="fast"
),
backward_backend=TritonBackend(
backward=True,
autotune=True,
autotune_mode="fast"
),
)
Triton benchmarks candidate configs and caches the best config in the current process. This is convenient for experiments, but the chosen config is not stored in the FFPA repository.
- Generate persistent tuned configs once, then use the default non-autotune path:
The generated JSON is saved under src/ffpa_attn/triton/configs/{device_name}.json , for example src/ffpa_attn/triton/configs/NVIDIA_L20.json.
Later calls with TritonBackend(..., autotune=False) will automatically load the matching device config when it exists.
Generate Persistent Configs¶
Run the autotune generator from the repository root or from an installed FFPA environment:
By default, the command refuses to overwrite an existing device config. Use
--overwrite when you intentionally want to regenerate it:
The generator defaults to B=1 and H=32. You can change them when your deployment shape uses a different batch size or query-head count:
By default, the generated task grid covers the baseline no-mask, no-dropout,
equal-head cases. Add --full-tasks to also tune canonical attn_mask,
dropout, GQA, and MQA variants modeled after python -m ffpa_attn.bench:
python -m ffpa_attn.autotune \
--mode fast \
--directions both \
--dtypes bf16,fp16 \
--full-tasks \
--overwrite
--full-tasks can increase autotune time substantially because each additional variant is benchmarked separately. It is intentionally disabled by default so existing generation jobs keep their current coverage and runtime. You can generate only forward configs, only backward configs, or both:
python -m ffpa_attn.autotune --mode fast --directions forward --overwrite
python -m ffpa_attn.autotune --mode fast --directions backward --overwrite
python -m ffpa_attn.autotune --mode fast --directions both --dtypes bf16,fp16 --overwrite
both is the default. The default dtype set is bf16. For benchmarks such as python -m ffpa_attn.bench that run both fp16 and bf16, generate both dtype configs explicitly:
On SM90+ devices, add --enable-fwd-tma or --enable-bwd-tma to additionally generate persistent configs for the descriptor/TMA forward or backward path when each task shape supports it. The baseline fwd_generic and bwd_generic configs are still generated as compatibility fallbacks. Add --enable-fwd-ws or --enable-bwd-ws with the matching TMA flag when you also want warp-specialized TMA candidates. The legacy --enable-tma and --enable-ws flags remain as aliases that enable both directions:
python -m ffpa_attn.autotune \
--mode fast \
--directions forward \
--dtypes bf16,fp16 \
--enable-fwd-tma \
--enable-fwd-ws \
--overwrite
Output Location¶
The default output directory is the package config directory:
The file name is derived from torch.cuda.get_device_name() with non-file-name characters replaced by underscores. For example:
For smoke tests or CI, write to a temporary directory so partial configs are not mistaken for full device configs:
python -m ffpa_attn.autotune \
--mode fast \
--directions both \
--overwrite \
--output-dir /tmp/ffpa-config-smoke
At runtime, FFPA can also load configs from a custom directory with FFPA_TUNED_CONFIG_DIR:
Multi-GPU Parallel Autotune¶
When multiple GPUs are available, you can distribute autotune tasks across them with --num-gpus. This uses Ray actors to ensure each GPU runs exactly one autotune task at a time, avoiding resource contention:
CUDA_VISIBLE_DEVICES=4,5,6,7 \
python -m ffpa_attn.autotune \
--mode max \
--directions both \
--dtypes bf16,fp16 \
--full-tasks \
--num-gpus 4 \
--overwrite
--num-gpus must not exceed the number of GPUs exposed by CUDA_VISIBLE_DEVICES. The generator validates this at startup and exits with a clear message when fewer GPUs are available than requested.
The parallel path uses a pool of Ray actors (one per GPU) and a queue-based scheduler: each worker starts with one task, and as soon as a worker finishes its task, it immediately receives the next pending task. This keeps all GPUs busy while respecting the one-task-per-GPU constraint.
Results from all workers are merged into the same device JSON. The output format, structure, and file path are identical to the single-GPU path. No intermediate files are written by the workers.
--num-gpus defaults to None, which keeps the existing single-GPU serial tuning path. On machines where Ray is not installed, omitting --num-gpus avoids any Ray dependency. The Ray runtime is only imported when --num-gpus is explicitly set.
Remote Ray Cluster¶
You can also submit autotune tasks to a remote Ray cluster by setting --ray-address:
python -m ffpa_attn.autotune \
--mode max --full-tasks \
--num-gpus 8 \
--ray-address ray://my-cluster-head:6379 \
--overwrite
When --ray-address is used, --num-gpus requests that many GPUs from the cluster. The local CUDA_VISIBLE_DEVICES setting is ignored.
Development Smoke Tests with Ray¶
During development, combine FFPA_AUTOTUNE_MAX_CONFIGS with --num-gpus for fast smoke tests:
CUDA_VISIBLE_DEVICES=4,5,6,7 \
FFPA_AUTOTUNE_MAX_CONFIGS=4 \
python -m ffpa_attn.autotune \
--mode fast \
--num-gpus 4 \
--overwrite \
--output-dir /tmp/ffpa-config-smoke
Autotune Modes¶
The generator supports the same mode names as the runtime Triton autotune path:
| Mode | Purpose |
|---|---|
| fast | Smaller search space. Recommended as the default persistent config mode. |
| max | Larger search space. Slower to generate, but may find better configs. |
The runtime lookup requires the mode to match. A JSON generated with --mode fast is used when triton_autotune_mode="fast"; a JSON generated with --mode max is used when triton_autotune_mode="max".
Shape Coverage¶
The generator tunes the following head dimensions:
The sequence-length grid is:
The 1 entry is used only for decode query length (Nq=1`). Decode tuning does not generate Nkv=1 cases because a single-token KV cache is not a meaningful decode-attention benchmark target.
The 16384 sequence length is generated only when the current GPU has at least 48 GiB of memory. Smaller-memory devices skip it.
Persistent config generation tunes every target sequence length in this grid with exact Triton autotune keys. It does not reuse the online runtime seqlen buckets while generating JSON, so an entry for 512, 1024, or 2048 means that shape was benchmarked independently. Runtime lookup still performs reuse when the workload shape is not an exact persisted entry. The generated matrix covers:
| Direction | Kernels |
|---|---|
| Forward | generic forward, split-KV/decode stage1 |
| Backward | delta preprocess, main backward, decode backward stage1 |
Forward tasks include self-attention, cross-attention, decode attention (Nq=1), causal, and non-causal variants.
Backward tasks include main backward shapes (Nq >= 512) and decode backward shapes (Nq=1, Nkv>1), with causal and non-causal variants.
When --full-tasks is enabled, the generator adds square prefill variants for each tuned sequence length:
| Case | Shape / variant |
|---|---|
| attn-mask | Compact additive key-position mask [1, 1, 1, Nkv]. Backward tunes the bias-gradient path. |
| dropout | dropout_p=0.1, using the Triton dropout path. |
| gqa | Hq=H, Hkv chosen with the same divisor rule as python -m ffpa_attn.bench. |
| mqa | Hq=H, Hkv=1. |
These are single-feature canonical variants, not a full Cartesian product. For example, --full-tasks tunes gqa and dropout separately, but it does not generate a combined GQA+dropout case.
Runtime Lookup Rules¶
When runtime autotune is disabled, FFPA tries to load the current device JSON. If no compatible entry is found, it falls back to the built-in default launch parameters. Forward lookup filters by:
| Field | Meaning |
|---|---|
| direction | Must be forward. |
| kernel | fwd_generic or decode_fwd_stage1. |
| autotune_mode | Must match triton_autotune_mode. |
| dtype | fp16 or bf16. |
| causal | Must match is_causal. |
| has_attn_bias | Whether attn_mask / additive bias is present. |
| has_dropout | Whether dropout is active. |
Backward lookup additionally filters by:
| Field | Meaning |
|---|---|
| kernel | bwd_preproc, bwd_generic, or decode_bwd_stage1. |
| preprocess_d_chunk | Applies to the delta preprocess kernel. |
| bias_grad | Whether attention-bias gradients are requested. |
| grad_v_storage_dtype | Optional internal dV storage override. |
| use_gemv | Decode backward single-query specialization. |
| has_dropout | Whether dropout replay is active. |
| has_attn_bias | Whether additive bias is active. |
Generated entries may include nheads_q and nheads_kv for logging and JSON metadata. Runtime lookup can prefer an exact recorded head layout when one is available, but it does not require the head layout to match. Batch size and head count commonly vary across workloads, so FFPA reuses the same launch config across compatible mask/dropout/causal/kernel variants instead of missing the persistent config because Hq or Hkv changed.
Configs generated before these variant fields existed are treated as no-mask, no-dropout entries. They can still satisfy baseline requests, while has_attn_bias and has_dropout continue to prevent semantically different kernel variants from being mixed.
After variant filtering, FFPA chooses the nearest persisted head dimension. Ties prefer the larger candidate. Examples:
| Runtime D | Persisted D |
|---|---|
| 384 | 320 |
| 448 | 512 |
| 900 | 1024 |
For sequence length, FFPA chooses the smallest persisted sequence length that is greater than or equal to the runtime value. If the runtime value is larger than all persisted values, FFPA uses the largest available persisted value. Examples:
| Runtime seqlen | Persisted seqlen |
|---|---|
| 3000 | 4096 |
| 32768 | 8192 or 16384, depending on what was generated |
Debug Persistent Lookup¶
Set FFPA_LOGGER_LEVEL=DEBUG when you want to verify that runtime calls are using persistent tuned configs instead of falling back to the built-in defaults:
FFPA_LOGGER_LEVEL=DEBUG \
FFPA_TUNED_CONFIG_DIR=/tmp/ffpa-config-smoke \
python -m ffpa_attn.bench --fwd-backend triton --bwd-backend triton
On repeated runtime lookup hits, FFPA logs the kernel name and sanitized launch config selected from the in-process persistent config cache. The message uses debug_once semantics, so the same cache-hit/config line is emitted once per process instead of repeating on every attention call.
Development Smoke Tests¶
Full autotune generation can take a long time. Use FFPA_AUTOTUNE_MAX_CONFIGS to cap the number of shapes during development:
CUDA_VISIBLE_DEVICES=0 \
FFPA_AUTOTUNE_MAX_CONFIGS=4 \
python -m ffpa_attn.autotune \
--mode fast \
--directions both \
--overwrite \
--output-dir /tmp/ffpa-config-smoke
Then run a small workload with that temporary directory:
To compare persistent tuned configs against the built-in fallback launch defaults without removing the JSON, force runtime lookup to bypass persisted entries:
CUDA_VISIBLE_DEVICES=0 \
FFPA_TUNED_CONFIG_DIR=/tmp/ffpa-config-smoke \
FFPA_SKIP_PERSISIT_TUNED_CONFIG=1 \
python your_script.py
For repository tests, a focused check is:
Production Workflow¶
A typical workflow is:
- Select the deployment GPU type.
- Generate a full config on that GPU:
CUDA_VISIBLE_DEVICES=0 python -m ffpa_attn.autotune \
--mode fast --directions both --dtypes bf16,fp16 --full-tasks --overwrite
- Commit the generated JSON under src/ffpa_attn/triton/configs/.
- Run normal workloads with runtime autotune disabled, which is the default:
If the runtime shape is outside the generated grid, FFPA uses the nearest compatible persisted config. If the device JSON is missing, malformed, or does not contain a compatible entry, FFPA silently keeps the existing built-in launch defaults.
Current Scope and Limitations¶
The first persistent-config generator focuses on the common no-bias and no-dropout path. The JSON schema already records bias_grad, has_dropout, and grad_v_storage_dtype, so bias-gradient and dropout-specific configs can be added later without changing the runtime lookup design.
decode_dq_reduce and key-bias gradient reduction are not currently autotuned by FFPA, so they keep their fixed launch parameters. Persistent configs are device-specific. Do not reuse a JSON generated on one GPU class as a performance baseline for a different GPU class.