text-generation-inference/docs/source/basic_tutorials/docker_launch.md

# Launching with Docker

The easiest way of getting started is using the official Docker container:

```shell
model=tiiuae/falcon-7b-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.0 --model-id $model
```
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.


You can then query the model using either the `/generate` or `/generate_stream` routes:

```shell
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'
```

```shell
curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'
```

or from Python:

```shell
pip install text-generation
```

```python
from text_generation import Client

client = Client("http://127.0.0.1:8080")
print(client.generate("What is Deep Learning?", max_new_tokens=20).generated_text)

text = ""
for response in client.generate_stream("What is Deep Learning?", max_new_tokens=20):
    if not response.token.special:
        text += response.token.text
print(text)
```

To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs)) or in the cli:
```
text-generation-launcher --help
```
Added installation and launch notes and re-structured toc 2023-07-31 14:35:36 +00:00			`# Launching with Docker`

			`The easiest way of getting started is using the official Docker container:`

			```shell
			`model=tiiuae/falcon-7b-instruct`
			`volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run`

			`docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.0 --model-id $model`
			```
			`Note: To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.`


			You can then query the model using either the `/generate` or `/generate_stream` routes:

			```shell
			`curl 127.0.0.1:8080/generate \`
			`-X POST \`
			`-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \`
			`-H 'Content-Type: application/json'`
			```

			```shell
			`curl 127.0.0.1:8080/generate_stream \`
			`-X POST \`
			`-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \`
			`-H 'Content-Type: application/json'`
			```

			`or from Python:`

			```shell
			`pip install text-generation`
			```

			```python
			`from text_generation import Client`

			`client = Client("http://127.0.0.1:8080")`
			`print(client.generate("What is Deep Learning?", max_new_tokens=20).generated_text)`

			`text = ""`
			`for response in client.generate_stream("What is Deep Learning?", max_new_tokens=20):`
			`if not response.token.special:`
			`text += response.token.text`
			`print(text)`
			```

			`To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs)) or in the cli:`
			```
			`text-generation-launcher --help`
			```