From f1976851d937c04e0103bef32cf35f6960b2d0cf Mon Sep 17 00:00:00 2001 From: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Mon, 20 May 2024 02:44:48 +0200 Subject: [PATCH] ROCm: make CK FA2 default instead of Triton (#1924) As per title. Triton autotune overhead is prohibitive, as it needs to be done for each different prompt length. --- Dockerfile_amd | 5 +++-- docs/source/installation_amd.md | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/Dockerfile_amd b/Dockerfile_amd index 6f8f874b..92dd0ea8 100644 --- a/Dockerfile_amd +++ b/Dockerfile_amd @@ -117,8 +117,9 @@ RUN cd pytorch && python tools/amd_build/build_amd.py && python setup.py install # Set as recommended: https://github.com/ROCm/triton/wiki/A-script-to-set-program-execution-environment-in-ROCm ENV HIP_FORCE_DEV_KERNARG=1 -# On MI300, performances for flash with Triton FA is very competitive (actually better than CK) -ENV ROCM_USE_FLASH_ATTN_V2_TRITON=1 +# On MI250 and MI300, performances for flash with Triton FA are slightly better than CK. +# However, Triton requires a tunning for each prompt length, which is prohibitive. +ENV ROCM_USE_FLASH_ATTN_V2_TRITON=0 FROM base AS kernel-builder diff --git a/docs/source/installation_amd.md b/docs/source/installation_amd.md index 9c6aa409..636d301c 100644 --- a/docs/source/installation_amd.md +++ b/docs/source/installation_amd.md @@ -29,7 +29,7 @@ TunableOp is enabled by default, the warmup may take 1-2 minutes. In case you wo Two implementations of Flash Attention are available for ROCm, the first is [ROCm/flash-attention](https://github.com/ROCm/flash-attention) based on a [Composable Kernel](https://github.com/ROCm/composable_kernel) (CK) implementation, and the second is a [Triton implementation](https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/flash_attn_triton.py). -By default, as its performances have experimentally been better, Triton implementation is used. It can be disabled (using CK implementation instead) by passing `--env ROCM_USE_FLASH_ATTN_V2_TRITON="0"` when launching TGI's docker container. +By default, the Composable Kernel implementation is used. However, the Triton implementation has slightly lower latency on MI250 and MI300, but requires a warmup which can be prohibitive as it needs to be done again for each new prompt length. If needed, FA Triton impelmentation can be enabled with `--env ROCM_USE_FLASH_ATTN_V2_TRITON="0"` when launching TGI's docker container. ## Unsupported features