Do not limit hpu graphs by default (#32) (#61)

Co-authored-by: mswiniarsk <156412439+mswiniarsk@users.noreply.github.com>
This commit is contained in:
jkaniecki 2024-02-21 15:38:00 +01:00 committed by GitHub
parent 6b6dec9ea1
commit 80303b469c
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 2 additions and 5 deletions

View File

@ -76,7 +76,7 @@ Environment Variables Added:
| PROF_WARMUPSTEP | integer | 0 | Enable/disable profile, control profile warmup step, 0 means disable profile | add -e in docker run command |
| PROF_STEP | interger | 5 | Control profile step | add -e in docker run command |
| PROF_PATH | string | /root/text-generation-inference | Define profile folder | add -e in docker run command |
| LIMIT_HPU_GRAPH | True/False | True | Skip HPU graph usage for prefill to save memory, set to `True` for large sequence/decoding lengths(e.g. 300/212) | add -e in docker run command |
| LIMIT_HPU_GRAPH | True/False | False | Skip HPU graph usage for prefill to save memory, set to `True` for large sequence/decoding lengths(e.g. 300/212) | add -e in docker run command |
| BATCH_BUCKET_SIZE | integer | 8 | Batch size for decode operation will be rounded to the nearest multiple of this number. This limits the number of cached graphs | add -e in docker run command |
| PREFILL_BATCH_BUCKET_SIZE | integer | 4 | Batch size for prefill operation will be rounded to the nearest multiple of this number. This limits the number of cached graphs | add -e in docker run command |
| SKIP_TOKENIZER_IN_TGI | True/False | False | Skip tokenizer for input/output processing | add -e in docker run command |
@ -90,9 +90,6 @@ Maximum batch size is controlled by two arguments:
- For decode operation, please set `--max-batch-total-tokens` as `bs * max-total-tokens`, where `bs` is your expected maximum decode batch size.
- Please note that batch size will be always padded to the nearest multiplication of `BATCH_BUCKET_SIZE` and `PREFILL_BATCH_BUCKET_SIZE`.
Current limitations:
- `LIMIT_HPU_GRAPH=False` causes accuracy issues and it should be avoided.
- Memory usage is higher than expected. Please consider using smaller batch sizes.
> The license to use TGI on Habana Gaudi is the one of TGI: https://github.com/huggingface/text-generation-inference/blob/main/LICENSE
>

View File

@ -482,7 +482,7 @@ class CausalLM(Model):
world_size = int(os.getenv("WORLD_SIZE", "1"))
rank = int(os.getenv("RANK", "0"))
self.enable_hpu_graph = os.getenv("ENABLE_HPU_GRAPH", "true").lower() == "true"
self.limit_hpu_graph = os.getenv("LIMIT_HPU_GRAPH", "true").lower() == "true"
self.limit_hpu_graph = os.getenv("LIMIT_HPU_GRAPH", "false").lower() == "true"
if world_size > 1:
import habana_frameworks.torch.hpu as torch_hpu