Update README

This commit is contained in:
ehsanmok 2023-04-26 18:10:13 -07:00
parent 37194a5b9a
commit a0e5fc4189

View File

@ -21,22 +21,24 @@ to power LLMs api-inference widgets.
## Table of contents ## Table of contents
- [Features](#features) - [Text Generation Inference](#text-generation-inference)
- [Optimized Architectures](#optimized-architectures) - [Table of contents](#table-of-contents)
- [Get Started](#get-started) - [Features](#features)
- [Docker](#docker) - [Optimized architectures](#optimized-architectures)
- [API Documentation](#api-documentation) - [Get started](#get-started)
- [A note on Shared Memory](#a-note-on-shared-memory-shm) - [Docker](#docker)
- [Distributed Tracing](#distributed-tracing) - [API documentation](#api-documentation)
- [Local Install](#local-install) - [Distributed Tracing](#distributed-tracing)
- [CUDA Kernels](#cuda-kernels) - [A note on Shared Memory (shm)](#a-note-on-shared-memory-shm)
- [Run BLOOM](#run-bloom) - [Local install](#local-install)
- [Download](#download) - [CUDA Kernels](#cuda-kernels)
- [Run](#run) - [Run BLOOM](#run-bloom)
- [Quantization](#quantization) - [Download](#download)
- [Develop](#develop) - [Run](#run)
- [Testing](#testing) - [Quantization](#quantization)
- [Develop](#develop)
- [Testing](#testing)
## Features ## Features
- Serve the most popular Large Language Models with a simple launcher - Serve the most popular Large Language Models with a simple launcher
@ -131,7 +133,7 @@ by setting the address to an OTLP collector with the `--otlp-endpoint` argument.
### A note on Shared Memory (shm) ### A note on Shared Memory (shm)
[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by [`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by
`PyTorch` to do distributed training/inference. `text-generation-inference` make `PyTorch` to do distributed training/inference. `text-generation-inference` make
use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models. use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.
@ -152,14 +154,14 @@ creating a volume with:
and mounting it to `/dev/shm`. and mounting it to `/dev/shm`.
Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that
this will impact performance. this will impact performance.
### Local install ### Local install
You can also opt to install `text-generation-inference` locally. You can also opt to install `text-generation-inference` locally.
First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
Python 3.9, e.g. using `conda`: Python 3.9, e.g. using `conda`:
```shell ```shell
@ -181,7 +183,7 @@ sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP rm -f $PROTOC_ZIP
``` ```
On MacOS, using Homebrew: On MacOS, using Homebrew:
```shell ```shell
brew install protobuf brew install protobuf
@ -241,6 +243,11 @@ make router-dev
## Testing ## Testing
```shell ```shell
# python
make python-server-tests
make python-client-tests
# or both server and client tests
make python-tests make python-tests
# rust cargo tests
make integration-tests make integration-tests
``` ```