mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-04-22 15:32:08 +00:00
review: do not use latest tag
This commit is contained in:
parent
9c998f9f7e
commit
00931438ea
@ -25,7 +25,6 @@ image:
|
|||||||
--ulimit nofile=100000:100000 \
|
--ulimit nofile=100000:100000 \
|
||||||
--build-arg VERSION=$(VERSION) \
|
--build-arg VERSION=$(VERSION) \
|
||||||
-t text-generation-inference:$(VERSION)-neuron ${root_dir}
|
-t text-generation-inference:$(VERSION)-neuron ${root_dir}
|
||||||
docker tag text-generation-inference:$(VERSION)-neuron text-generation-inference:latest-neuron
|
|
||||||
|
|
||||||
install_server:
|
install_server:
|
||||||
make -C ${mkfile_dir}/server install VERSION:=${VERSION}
|
make -C ${mkfile_dir}/server install VERSION:=${VERSION}
|
||||||
|
@ -31,7 +31,7 @@ deployment instructions in the model card:
|
|||||||
The service is launched simply by running the text-generation-inference container with two sets of parameters:
|
The service is launched simply by running the text-generation-inference container with two sets of parameters:
|
||||||
|
|
||||||
```
|
```
|
||||||
docker run <system_parameters> ghcr.io/huggingface/text-generation-inference:latest-neuron <service_parameters>
|
docker run <system_parameters> ghcr.io/huggingface/text-generation-inference:3.1.0-neuron <service_parameters>
|
||||||
```
|
```
|
||||||
|
|
||||||
- system parameters are used to map ports, volumes and devices between the host and the service,
|
- system parameters are used to map ports, volumes and devices between the host and the service,
|
||||||
@ -59,7 +59,7 @@ docker run -p 8080:80 \
|
|||||||
-v $(pwd)/data:/data \
|
-v $(pwd)/data:/data \
|
||||||
--privileged \
|
--privileged \
|
||||||
-e HF_TOKEN=${HF_TOKEN} \
|
-e HF_TOKEN=${HF_TOKEN} \
|
||||||
ghcr.io/huggingface/text-generation-inference:latest-neuron \
|
ghcr.io/huggingface/text-generation-inference:<VERSION>-neuron \
|
||||||
<service_parameters>
|
<service_parameters>
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -70,7 +70,7 @@ docker run -p 8080:80 \
|
|||||||
-v $(pwd)/data:/data \
|
-v $(pwd)/data:/data \
|
||||||
--device=/dev/neuron0 \
|
--device=/dev/neuron0 \
|
||||||
-e HF_TOKEN=${HF_TOKEN} \
|
-e HF_TOKEN=${HF_TOKEN} \
|
||||||
ghcr.io/huggingface/text-generation-inference:latest-neuron \
|
ghcr.io/huggingface/text-generation-inference:<VERSION>-neuron \
|
||||||
<service_parameters>
|
<service_parameters>
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -92,7 +92,7 @@ docker run -p 8080:80 \
|
|||||||
-e HF_TOKEN=${HF_TOKEN} \
|
-e HF_TOKEN=${HF_TOKEN} \
|
||||||
-e HF_AUTO_CAST_TYPE="fp16" \
|
-e HF_AUTO_CAST_TYPE="fp16" \
|
||||||
-e HF_NUM_CORES=2 \
|
-e HF_NUM_CORES=2 \
|
||||||
ghcr.io/huggingface/text-generation-inference:latest-neuron:latest \
|
ghcr.io/huggingface/text-generation-inference:<VERSION>-neuron \
|
||||||
--model-id meta-llama/Meta-Llama-3-8B \
|
--model-id meta-llama/Meta-Llama-3-8B \
|
||||||
--max-batch-size 1 \
|
--max-batch-size 1 \
|
||||||
--max-input-length 3164 \
|
--max-input-length 3164 \
|
||||||
@ -101,7 +101,7 @@ docker run -p 8080:80 \
|
|||||||
|
|
||||||
### Using a model exported to a local path
|
### Using a model exported to a local path
|
||||||
|
|
||||||
Alternatively, you can first [export the model to neuron format](https://huggingface.co/docs/optimum-neuron/main/en/guides/export_model#exporting-neuron-models-using-text-generation-inference:latest-neuron) locally.
|
Alternatively, you can first [export the model to neuron format](https://huggingface.co/docs/optimum-neuron/main/en/guides/export_model#exporting-neuron-models-using-text-generation-inference) locally.
|
||||||
|
|
||||||
You can then deploy the service inside the shared volume:
|
You can then deploy the service inside the shared volume:
|
||||||
|
|
||||||
@ -109,7 +109,7 @@ You can then deploy the service inside the shared volume:
|
|||||||
docker run -p 8080:80 \
|
docker run -p 8080:80 \
|
||||||
-v $(pwd)/data:/data \
|
-v $(pwd)/data:/data \
|
||||||
--privileged \
|
--privileged \
|
||||||
ghcr.io/huggingface/text-generation-inference:latest-neuron:latest \
|
ghcr.io/huggingface/text-generation-inference:<VERSION>-neuron \
|
||||||
--model-id /data/<neuron_model_path>
|
--model-id /data/<neuron_model_path>
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -126,7 +126,7 @@ docker run -p 8080:80 \
|
|||||||
-v $(pwd)/data:/data \
|
-v $(pwd)/data:/data \
|
||||||
--privileged \
|
--privileged \
|
||||||
-e HF_TOKEN=${HF_TOKEN} \
|
-e HF_TOKEN=${HF_TOKEN} \
|
||||||
ghcr.io/huggingface/text-generation-inference:latest-neuron:latest \
|
ghcr.io/huggingface/text-generation-inference:<VERSION>-neuron \
|
||||||
--model-id <organization>/<neuron-model>
|
--model-id <organization>/<neuron-model>
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -135,7 +135,7 @@ docker run -p 8080:80 \
|
|||||||
Use the following command to list the available service parameters:
|
Use the following command to list the available service parameters:
|
||||||
|
|
||||||
```
|
```
|
||||||
docker run ghcr.io/huggingface/text-generation-inference:latest-neuron --help
|
docker run ghcr.io/huggingface/text-generation-inference:<VERSION>-neuron --help
|
||||||
```
|
```
|
||||||
|
|
||||||
The configuration of an inference endpoint is always a compromise between throughput and latency: serving more requests in parallel will allow a higher throughput, but it will increase the latency.
|
The configuration of an inference endpoint is always a compromise between throughput and latency: serving more requests in parallel will allow a higher throughput, but it will increase the latency.
|
||||||
|
Loading…
Reference in New Issue
Block a user