review: do not use latest tag

This commit is contained in:
David Corvoysier 2025-02-18 13:50:25 +00:00
parent 9c998f9f7e
commit 00931438ea
2 changed files with 8 additions and 9 deletions

View File

@ -25,7 +25,6 @@ image:
--ulimit nofile=100000:100000 \ --ulimit nofile=100000:100000 \
--build-arg VERSION=$(VERSION) \ --build-arg VERSION=$(VERSION) \
-t text-generation-inference:$(VERSION)-neuron ${root_dir} -t text-generation-inference:$(VERSION)-neuron ${root_dir}
docker tag text-generation-inference:$(VERSION)-neuron text-generation-inference:latest-neuron
install_server: install_server:
make -C ${mkfile_dir}/server install VERSION:=${VERSION} make -C ${mkfile_dir}/server install VERSION:=${VERSION}

View File

@ -31,7 +31,7 @@ deployment instructions in the model card:
The service is launched simply by running the text-generation-inference container with two sets of parameters: The service is launched simply by running the text-generation-inference container with two sets of parameters:
``` ```
docker run <system_parameters> ghcr.io/huggingface/text-generation-inference:latest-neuron <service_parameters> docker run <system_parameters> ghcr.io/huggingface/text-generation-inference:3.1.0-neuron <service_parameters>
``` ```
- system parameters are used to map ports, volumes and devices between the host and the service, - system parameters are used to map ports, volumes and devices between the host and the service,
@ -59,7 +59,7 @@ docker run -p 8080:80 \
-v $(pwd)/data:/data \ -v $(pwd)/data:/data \
--privileged \ --privileged \
-e HF_TOKEN=${HF_TOKEN} \ -e HF_TOKEN=${HF_TOKEN} \
ghcr.io/huggingface/text-generation-inference:latest-neuron \ ghcr.io/huggingface/text-generation-inference:<VERSION>-neuron \
<service_parameters> <service_parameters>
``` ```
@ -70,7 +70,7 @@ docker run -p 8080:80 \
-v $(pwd)/data:/data \ -v $(pwd)/data:/data \
--device=/dev/neuron0 \ --device=/dev/neuron0 \
-e HF_TOKEN=${HF_TOKEN} \ -e HF_TOKEN=${HF_TOKEN} \
ghcr.io/huggingface/text-generation-inference:latest-neuron \ ghcr.io/huggingface/text-generation-inference:<VERSION>-neuron \
<service_parameters> <service_parameters>
``` ```
@ -92,7 +92,7 @@ docker run -p 8080:80 \
-e HF_TOKEN=${HF_TOKEN} \ -e HF_TOKEN=${HF_TOKEN} \
-e HF_AUTO_CAST_TYPE="fp16" \ -e HF_AUTO_CAST_TYPE="fp16" \
-e HF_NUM_CORES=2 \ -e HF_NUM_CORES=2 \
ghcr.io/huggingface/text-generation-inference:latest-neuron:latest \ ghcr.io/huggingface/text-generation-inference:<VERSION>-neuron \
--model-id meta-llama/Meta-Llama-3-8B \ --model-id meta-llama/Meta-Llama-3-8B \
--max-batch-size 1 \ --max-batch-size 1 \
--max-input-length 3164 \ --max-input-length 3164 \
@ -101,7 +101,7 @@ docker run -p 8080:80 \
### Using a model exported to a local path ### Using a model exported to a local path
Alternatively, you can first [export the model to neuron format](https://huggingface.co/docs/optimum-neuron/main/en/guides/export_model#exporting-neuron-models-using-text-generation-inference:latest-neuron) locally. Alternatively, you can first [export the model to neuron format](https://huggingface.co/docs/optimum-neuron/main/en/guides/export_model#exporting-neuron-models-using-text-generation-inference) locally.
You can then deploy the service inside the shared volume: You can then deploy the service inside the shared volume:
@ -109,7 +109,7 @@ You can then deploy the service inside the shared volume:
docker run -p 8080:80 \ docker run -p 8080:80 \
-v $(pwd)/data:/data \ -v $(pwd)/data:/data \
--privileged \ --privileged \
ghcr.io/huggingface/text-generation-inference:latest-neuron:latest \ ghcr.io/huggingface/text-generation-inference:<VERSION>-neuron \
--model-id /data/<neuron_model_path> --model-id /data/<neuron_model_path>
``` ```
@ -126,7 +126,7 @@ docker run -p 8080:80 \
-v $(pwd)/data:/data \ -v $(pwd)/data:/data \
--privileged \ --privileged \
-e HF_TOKEN=${HF_TOKEN} \ -e HF_TOKEN=${HF_TOKEN} \
ghcr.io/huggingface/text-generation-inference:latest-neuron:latest \ ghcr.io/huggingface/text-generation-inference:<VERSION>-neuron \
--model-id <organization>/<neuron-model> --model-id <organization>/<neuron-model>
``` ```
@ -135,7 +135,7 @@ docker run -p 8080:80 \
Use the following command to list the available service parameters: Use the following command to list the available service parameters:
``` ```
docker run ghcr.io/huggingface/text-generation-inference:latest-neuron --help docker run ghcr.io/huggingface/text-generation-inference:<VERSION>-neuron --help
``` ```
The configuration of an inference endpoint is always a compromise between throughput and latency: serving more requests in parallel will allow a higher throughput, but it will increase the latency. The configuration of an inference endpoint is always a compromise between throughput and latency: serving more requests in parallel will allow a higher throughput, but it will increase the latency.