mirror of https://github.com/huggingface/text-generation-inference.git synced 2025-09-09 11:24:53 +00:00

David Corvoysier 238fbd4d50

Neuron backend fix and patch version 3.3.4 (#3273 )

* fix(neuron): wrong assertion when batch_size==1

* chore: prepare 3.3.4

2025-06-19 10:52:41 +02:00

1.7 KiB

Raw Permalink Blame History

Using TGI with Intel GPUs

TGI optimized models are supported on Intel Data Center GPU Max1100, Max1550, the recommended usage is through Docker.

On a server powered by Intel GPUs, TGI can be launched with the following command:

model=teknium/OpenHermes-2.5-Mistral-7B
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --rm --privileged --cap-add=sys_nice \
    --device=/dev/dri \
    --ipc=host --shm-size 1g --net host -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:3.3.4-intel-xpu \
    --model-id $model --cuda-graphs 0

Using TGI with Intel CPUs

Intel® Extension for PyTorch (IPEX) also provides further optimizations for Intel CPUs. The IPEX provides optimization operations such as flash attention, page attention, Add + LayerNorm, ROPE and more.

On a server powered by Intel CPU, TGI can be launched with the following command:

model=teknium/OpenHermes-2.5-Mistral-7B
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --rm --privileged --cap-add=sys_nice \
    --device=/dev/dri \
    --ipc=host --shm-size 1g --net host -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:3.3.4-intel-cpu \
    --model-id $model --cuda-graphs 0

The launched TGI server can then be queried from clients, make sure to check out the Consuming TGI guide.

1.7 KiB Raw Permalink Blame History

Using TGI with Intel GPUs

Using TGI with Intel CPUs

1.7 KiB

Raw Permalink Blame History