mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-10 20:04:52 +00:00
Merge branch 'main' into added_cli_docs
This commit is contained in:
commit
d90425eaa7
@ -6,7 +6,7 @@ There are many ways you can consume Text Generation Inference server in your app
|
|||||||
|
|
||||||
After the launch, you can query the model using either the `/generate` or `/generate_stream` routes:
|
After the launch, you can query the model using either the `/generate` or `/generate_stream` routes:
|
||||||
|
|
||||||
```shell
|
```bash
|
||||||
curl 127.0.0.1:8080/generate \
|
curl 127.0.0.1:8080/generate \
|
||||||
-X POST \
|
-X POST \
|
||||||
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
|
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
|
||||||
@ -20,14 +20,13 @@ curl 127.0.0.1:8080/generate \
|
|||||||
|
|
||||||
You can simply install `huggingface-hub` package with pip.
|
You can simply install `huggingface-hub` package with pip.
|
||||||
|
|
||||||
```python
|
```bash
|
||||||
pip install huggingface-hub
|
pip install huggingface-hub
|
||||||
```
|
```
|
||||||
|
|
||||||
Once you start the TGI server, instantiate `InferenceClient()` with the URL to the endpoint serving the model. You can then call `text_generation()` to hit the endpoint through Python.
|
Once you start the TGI server, instantiate `InferenceClient()` with the URL to the endpoint serving the model. You can then call `text_generation()` to hit the endpoint through Python.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
|
||||||
from huggingface_hub import InferenceClient
|
from huggingface_hub import InferenceClient
|
||||||
|
|
||||||
client = InferenceClient(model=URL_TO_ENDPOINT_SERVING_TGI)
|
client = InferenceClient(model=URL_TO_ENDPOINT_SERVING_TGI)
|
||||||
|
@ -2,4 +2,23 @@
|
|||||||
|
|
||||||
If the model you wish to serve is behind gated access or the model repository on Hugging Face Hub is private, and you have access to the model, you can provide your Hugging Face Hub access token. You can generate and copy a read token from [Hugging Face Hub tokens page](https://huggingface.co/settings/tokens)
|
If the model you wish to serve is behind gated access or the model repository on Hugging Face Hub is private, and you have access to the model, you can provide your Hugging Face Hub access token. You can generate and copy a read token from [Hugging Face Hub tokens page](https://huggingface.co/settings/tokens)
|
||||||
|
|
||||||
If you're using the CLI, set the `HUGGING_FACE_HUB_TOKEN` environment variable.
|
If you're using the CLI, set the `HUGGING_FACE_HUB_TOKEN` environment variable. For example:
|
||||||
|
|
||||||
|
```
|
||||||
|
export HUGGING_FACE_HUB_TOKEN=<YOUR READ TOKEN>
|
||||||
|
```
|
||||||
|
|
||||||
|
If you would like to do it through Docker, you can provide your token by specifying `HUGGING_FACE_HUB_TOKEN` as shown below.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
model=meta-llama/Llama-2-7b-chat-hf
|
||||||
|
volume=$PWD/data
|
||||||
|
token=<your READ token>
|
||||||
|
|
||||||
|
docker run --gpus all \
|
||||||
|
--shm-size 1g \
|
||||||
|
-e HUGGING_FACE_HUB_TOKEN=$token \
|
||||||
|
-p 8080:80 \
|
||||||
|
-v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.0 \
|
||||||
|
--model-id $model
|
||||||
|
```
|
@ -77,3 +77,9 @@ make run-falcon-7b-instruct
|
|||||||
```
|
```
|
||||||
|
|
||||||
This will serve Falcon 7B Instruct model from the port 8080, which we can query.
|
This will serve Falcon 7B Instruct model from the port 8080, which we can query.
|
||||||
|
|
||||||
|
To see all options to serve your models, check in the [codebase](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or the CLI:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
text-generation-launcher --help
|
||||||
|
```
|
||||||
|
Loading…
Reference in New Issue
Block a user