Skip to content

Configurable Environment Variables

This document summarizes all configurable environment variables in ffpa-attn, which control the kernel build (target SM list, headdim coverage, ccache shim) and the runtime kernel selection knobs (MMA accumulator dtype, SMEM swizzle, prefetch / persist policies, launch grid layout). Most runtime knobs are boolean integers parsed from 0 / 1.

Build-time environment variables

These are read once during pip install . / python setup.py build_ext and decide which translation units are generated and how nvcc is invoked.

  • FFPA_BUILD_ARCH, default "" (current device SM), Comma/semicolon/space separated list of target CUDA SM architectures. Accepts numeric SMs (e.g. "80,89,90") or aliases (maxwell, pascal, volta, turing, ampere, ada, hopper, blackwell, blackwell_geforce). When empty, falls back to the currently visible CUDA device's compute capability.
  • ENABLE_FFPA_CUDA_IMPL, default False (0), Build the optional native CUDA forward extension (ffpa_attn._C). When unset, the package builds in pure Python / Triton-only mode and large-D forward defaults to the Triton backend. Native CUDA backward is no longer generated; use Triton or SDPA backward.
  • FFPA_SKIP_CUDA_EXT, default False (0), Force-skip the CUDA extension build even if ENABLE_FFPA_CUDA_IMPL=1 is set. Used by ReadTheDocs and the check-mkdocs GitHub workflow where no nvcc is available; mkdocstrings can still import ffpa_attn to extract docstrings.
  • FFPA_NVCC_THREADS, default 4, nvcc intra-TU parallelism (--threads N). With the per-headdim TU split, the outer MAX_JOBS already drives many nvcc processes in parallel, so keeping --threads small avoids oversubscription. Set to 1 to disable intra-TU threading entirely; larger values only help when MAX_JOBS is small.
  • FFPA_PTXAS_VERBOSE, default False (0), Emit ptxas verbose info (register / SMEM usage). Off by default because it produces tens of MB of log output; enable only for tuning.
  • FFPA_DEV_HEADDIMS, default "", Development-time headdim subset override. Comma/space separated list of headdims (e.g. "256,512") that replaces the full generated set for fast iteration. Empty (default) means use the full set decided by ENABLE_FFPA_ALL_HEADDIM.
  • ENABLE_FFPA_ALL_STAGES, default True (1), When 1, generate kernels for all multi-stage variants (stages 1~4); when 0, only stages 1~2 are generated. Reducing this shortens build time at the cost of fewer schedule choices at runtime.
  • ENABLE_FFPA_ALL_HEADDIM, default False (0), When 1, headdims range from 32 to 1024 with step 32 (range(32, 1024, 32)); when 0, headdims range from 256 to 1024 with step 64 (range(256, 1024, 64)).
  • MAX_JOBS, default min(nproc, 32) via tools/build_fast.sh, Outer build parallelism passed to setuptools. The fast-build wrapper auto-caps at 32; for plain python setup.py build_ext, set it explicitly (e.g. MAX_JOBS=32).

tools/build_fast.sh-only variables

These only affect the ccache-based fast-build wrapper.

  • FFPA_CLEAN, default 0, When 1, removes build/, dist/, ffpa_attn.egg-info/, compiled *.so, and generated csrc/cuffpa/generated/*.{cu,h} before rebuilding.
  • FFPA_BUILD_IN_SHM, default 0, When 1, symlinks build/ into /dev/shm/ffpa-build-$USER (tmpfs) for IO-bound machines.
  • CCACHE_MAXSIZE, default 20G, Cap of the ccache storage used by the nvcc shim.
  • CCACHE_DIR, default ~/.ccache, ccache storage directory.
  • NVCC_REAL, default $CUDA_HOME/bin/nvcc, Override the path to the real nvcc (advanced; rarely needed).

Runtime kernel-selection environment variables

These are read by env.py and gate which generated kernel template is dispatched at import / call time. All are booleans (0 / 1) unless noted.

  • FFPA_SKIP_PERSISIT_TUNED_CONFIG, default False (0), When set to 1, lookup_persistent_config() always returns None even if a device JSON exists and matches. This is useful for A/B benchmarking the persistent tuned-config path against the built-in fallback launch defaults without changing call sites.

MMA accumulator dtype

  • ENABLE_FFPA_FORCE_QK_F16, default False (0), Force Q@K^T MMA accumulator to FP16 within the FFPA Acc-F32 kernels. Enables the mixed mode Q@K^T MMA Acc F16 + P@V MMA Acc F32.
  • ENABLE_FFPA_FORCE_PV_F16, default False (0), Force P@V MMA accumulator to FP16 within the FFPA Acc-F32 kernels. Enables the mixed mode Q@K^T MMA Acc F32 + P@V MMA Acc F16.

Prefetch & SMEM-share policies

  • ENABLE_FFPA_PREFETCH_QKV, default True (1), Prefetch QKV at the appropriate time point. Typical boost is 5%~10%.
  • ENABLE_FFPA_QKV_SMEM_SHARE, default False (0), Use a shared QKV SMEM policy. Off by default because separate buffers overlap better with MMA / g2s; turn on for low-SRAM devices.

SMEM swizzle (vs. padding) for Q / K / V

  • ENABLE_FFPA_SMEM_SWIZZLE_Q, default True (1), True: bank-conflict-free Q SMEM via swizzle. False: bank-conflict-free via padding.
  • ENABLE_FFPA_SMEM_SWIZZLE_K, default True (1), Same as above for K SMEM.
  • ENABLE_FFPA_SMEM_SWIZZLE_V, default True (1), Same as above for V SMEM.

Persistent g2s / s2r loads

  • ENABLE_FFPA_PERSIST_Q_G2S, default True (1), Persistently keep Q in SMEM via g2s for headdim <= 320. Trades more SRAM for fewer global loads while keeping register usage stable.
  • ENABLE_FFPA_PERSIST_KV_G2S, default True (1), Persistently keep KV in SMEM via g2s for headdim <= 256. When enabled, FFPA auto-uses the FlashAttention attention-level tiling for headdim <= 256 and the FFPA fine-grained MMA-level tiling for headdim > 256.
  • ENABLE_FFPA_PERSIST_Q_S2R, default False (0), Persistently load Q s2r for headdim < 512 to reduce Q g2s/s2r IO while preserving O(1) SRAM complexity. Adds register pressure as headdim grows; weigh register usage vs. IO reduction before enabling.
  • ENABLE_FFPA_PERSIST_V_S2R, default True (1), Persistently load V s2r for the small-d kernel only (more registers).

Pipelining and launch grid layout

  • ENABLE_FFPA_REGISTERS_PIPE_KV, default False (0), Use register-level ping-pong double buffers for ldmatrix / MMA computation overlap.
  • ENABLE_FFPA_LAUNCH_GRID_DNHB, default False (0), When 1, launch with grid(N/Br, H, B); when 0, launch with grid(N/Br, B * H).

Key notes

  1. Boolean parsing: all ENABLE_FFPA_* and FFPA_PTXAS_VERBOSE are parsed from integer values: 1 = True (enabled), 0 = False (disabled).
  2. Build-time vs. runtime separation: ENABLE_FFPA_ALL_STAGES and ENABLE_FFPA_ALL_HEADDIM decide which TUs are generated at build time — changing them requires rebuilding (FFPA_CLEAN=1 bash tools/build_fast.sh). All other ENABLE_FFPA_* knobs are read at runtime and can be toggled without rebuilding.
  3. Default build mode: source builds are Triton-only unless ENABLE_FFPA_CUDA_IMPL=1 is set. FFPA_SKIP_CUDA_EXT=1 is now only an override that force-disables CUDA extension builds even when the CUDA build flags are enabled.
  4. Default arch inference: when FFPA_BUILD_ARCH is unset and ENABLE_FFPA_CUDA_IMPL=1, setup.py queries the currently visible CUDA device's compute capability. On CI / RTD runners with no GPU, either keep the default Triton-only build or explicitly set FFPA_SKIP_CUDA_EXT=1.
  5. MMA Acc combinations: ENABLE_FFPA_FORCE_QK_F16 and ENABLE_FFPA_FORCE_PV_F16 are mutually exclusive options for the FFPA Acc-F32 kernels — enable at most one to choose the mixed-precision split.
  6. Recommended for tuning: enable FFPA_PTXAS_VERBOSE=1 (build-time) and start from the defaults; toggle one runtime knob at a time and measure with bench/bench_ffpa_fwd.py to attribute the speedup or regression.