Configurable Environment Variables¶
This document summarizes all configurable environment variables in ffpa-attn, which control the kernel build (target SM list, headdim coverage, ccache shim) and the runtime kernel selection knobs (MMA accumulator dtype, SMEM swizzle, prefetch / persist policies, launch grid layout). Most runtime knobs are boolean integers parsed from 0 / 1.
Build-time environment variables¶
These are read once during pip install . / python setup.py build_ext and decide which translation units are generated and how nvcc is invoked.
- FFPA_BUILD_ARCH, default
""(current device SM), Comma/semicolon/space separated list of target CUDA SM architectures. Accepts numeric SMs (e.g."80,89,90") or aliases (maxwell,pascal,volta,turing,ampere,ada,hopper,blackwell,blackwell_geforce). When empty, falls back to the currently visible CUDA device's compute capability. - ENABLE_FFPA_CUDA_IMPL, default
False (0), Build the optional native CUDA forward extension (ffpa_attn._C). When unset, the package builds in pure Python / Triton-only mode and large-D forward defaults to the Triton backend. Native CUDA backward is no longer generated; use Triton or SDPA backward. - FFPA_SKIP_CUDA_EXT, default
False (0), Force-skip the CUDA extension build even ifENABLE_FFPA_CUDA_IMPL=1is set. Used by ReadTheDocs and thecheck-mkdocsGitHub workflow where no nvcc is available; mkdocstrings can still importffpa_attnto extract docstrings. - FFPA_NVCC_THREADS, default
4, nvcc intra-TU parallelism (--threads N). With the per-headdim TU split, the outerMAX_JOBSalready drives many nvcc processes in parallel, so keeping--threadssmall avoids oversubscription. Set to1to disable intra-TU threading entirely; larger values only help whenMAX_JOBSis small. - FFPA_PTXAS_VERBOSE, default
False (0), Emit ptxas verbose info (register / SMEM usage). Off by default because it produces tens of MB of log output; enable only for tuning. - FFPA_DEV_HEADDIMS, default
"", Development-time headdim subset override. Comma/space separated list of headdims (e.g."256,512") that replaces the full generated set for fast iteration. Empty (default) means use the full set decided byENABLE_FFPA_ALL_HEADDIM. - ENABLE_FFPA_ALL_STAGES, default
True (1), When1, generate kernels for all multi-stage variants (stages1~4); when0, only stages1~2are generated. Reducing this shortens build time at the cost of fewer schedule choices at runtime. - ENABLE_FFPA_ALL_HEADDIM, default
False (0), When1, headdims range from32to1024with step32(range(32, 1024, 32)); when0, headdims range from256to1024with step64(range(256, 1024, 64)). - MAX_JOBS, default
min(nproc, 32)viatools/build_fast.sh, Outer build parallelism passed to setuptools. The fast-build wrapper auto-caps at 32; for plainpython setup.py build_ext, set it explicitly (e.g.MAX_JOBS=32).
tools/build_fast.sh-only variables¶
These only affect the ccache-based fast-build wrapper.
- FFPA_CLEAN, default
0, When1, removesbuild/,dist/,ffpa_attn.egg-info/, compiled*.so, and generatedcsrc/cuffpa/generated/*.{cu,h}before rebuilding. - FFPA_BUILD_IN_SHM, default
0, When1, symlinksbuild/into/dev/shm/ffpa-build-$USER(tmpfs) for IO-bound machines. - CCACHE_MAXSIZE, default
20G, Cap of the ccache storage used by the nvcc shim. - CCACHE_DIR, default
~/.ccache, ccache storage directory. - NVCC_REAL, default
$CUDA_HOME/bin/nvcc, Override the path to the real nvcc (advanced; rarely needed).
Runtime kernel-selection environment variables¶
These are read by env.py and gate which generated kernel template is dispatched at import / call time. All are booleans (0 / 1) unless noted.
- FFPA_SKIP_PERSISIT_TUNED_CONFIG, default
False (0), When set to1,lookup_persistent_config()always returnsNoneeven if a device JSON exists and matches. This is useful for A/B benchmarking the persistent tuned-config path against the built-in fallback launch defaults without changing call sites.
MMA accumulator dtype¶
- ENABLE_FFPA_FORCE_QK_F16, default
False (0), ForceQ@K^TMMA accumulator to FP16 within the FFPA Acc-F32 kernels. Enables the mixed modeQ@K^T MMA Acc F16 + P@V MMA Acc F32. - ENABLE_FFPA_FORCE_PV_F16, default
False (0), ForceP@VMMA accumulator to FP16 within the FFPA Acc-F32 kernels. Enables the mixed modeQ@K^T MMA Acc F32 + P@V MMA Acc F16.
Prefetch & SMEM-share policies¶
- ENABLE_FFPA_PREFETCH_QKV, default
True (1), Prefetch QKV at the appropriate time point. Typical boost is5%~10%. - ENABLE_FFPA_QKV_SMEM_SHARE, default
False (0), Use a shared QKV SMEM policy. Off by default because separate buffers overlap better with MMA / g2s; turn on for low-SRAM devices.
SMEM swizzle (vs. padding) for Q / K / V¶
- ENABLE_FFPA_SMEM_SWIZZLE_Q, default
True (1),True: bank-conflict-free Q SMEM via swizzle.False: bank-conflict-free via padding. - ENABLE_FFPA_SMEM_SWIZZLE_K, default
True (1), Same as above for K SMEM. - ENABLE_FFPA_SMEM_SWIZZLE_V, default
True (1), Same as above for V SMEM.
Persistent g2s / s2r loads¶
- ENABLE_FFPA_PERSIST_Q_G2S, default
True (1), Persistently keep Q in SMEM via g2s for headdim<= 320. Trades more SRAM for fewer global loads while keeping register usage stable. - ENABLE_FFPA_PERSIST_KV_G2S, default
True (1), Persistently keep KV in SMEM via g2s for headdim<= 256. When enabled, FFPA auto-uses the FlashAttention attention-level tiling forheaddim <= 256and the FFPA fine-grained MMA-level tiling forheaddim > 256. - ENABLE_FFPA_PERSIST_Q_S2R, default
False (0), Persistently load Q s2r for headdim< 512to reduce Q g2s/s2r IO while preserving O(1) SRAM complexity. Adds register pressure as headdim grows; weigh register usage vs. IO reduction before enabling. - ENABLE_FFPA_PERSIST_V_S2R, default
True (1), Persistently load V s2r for the small-d kernel only (more registers).
Pipelining and launch grid layout¶
- ENABLE_FFPA_REGISTERS_PIPE_KV, default
False (0), Use register-level ping-pong double buffers forldmatrix/ MMA computation overlap. - ENABLE_FFPA_LAUNCH_GRID_DNHB, default
False (0), When1, launch withgrid(N/Br, H, B); when0, launch withgrid(N/Br, B * H).
Key notes¶
- Boolean parsing: all
ENABLE_FFPA_*andFFPA_PTXAS_VERBOSEare parsed from integer values:1=True(enabled),0=False(disabled). - Build-time vs. runtime separation:
ENABLE_FFPA_ALL_STAGESandENABLE_FFPA_ALL_HEADDIMdecide which TUs are generated at build time — changing them requires rebuilding (FFPA_CLEAN=1 bash tools/build_fast.sh). All otherENABLE_FFPA_*knobs are read at runtime and can be toggled without rebuilding. - Default build mode: source builds are Triton-only unless
ENABLE_FFPA_CUDA_IMPL=1is set.FFPA_SKIP_CUDA_EXT=1is now only an override that force-disables CUDA extension builds even when the CUDA build flags are enabled. - Default arch inference: when
FFPA_BUILD_ARCHis unset andENABLE_FFPA_CUDA_IMPL=1,setup.pyqueries the currently visible CUDA device's compute capability. On CI / RTD runners with no GPU, either keep the default Triton-only build or explicitly setFFPA_SKIP_CUDA_EXT=1. - MMA Acc combinations:
ENABLE_FFPA_FORCE_QK_F16andENABLE_FFPA_FORCE_PV_F16are mutually exclusive options for the FFPA Acc-F32 kernels — enable at most one to choose the mixed-precision split. - Recommended for tuning: enable
FFPA_PTXAS_VERBOSE=1(build-time) and start from the defaults; toggle one runtime knob at a time and measure withbench/bench_ffpa_fwd.pyto attribute the speedup or regression.