From a0e5fc4189417f97de66038f8674e3516bd7e49c Mon Sep 17 00:00:00 2001 From: ehsanmok <6980212+ehsanmok@users.noreply.github.com> Date: Wed, 26 Apr 2023 18:10:13 -0700 Subject: [PATCH] Update README --- README.md | 49 ++++++++++++++++++++++++++++--------------------- 1 file changed, 28 insertions(+), 21 deletions(-) diff --git a/README.md b/README.md index 0c63f36b..54ecdbb2 100644 --- a/README.md +++ b/README.md @@ -21,22 +21,24 @@ to power LLMs api-inference widgets. ## Table of contents -- [Features](#features) -- [Optimized Architectures](#optimized-architectures) -- [Get Started](#get-started) - - [Docker](#docker) - - [API Documentation](#api-documentation) - - [A note on Shared Memory](#a-note-on-shared-memory-shm) - - [Distributed Tracing](#distributed-tracing) - - [Local Install](#local-install) - - [CUDA Kernels](#cuda-kernels) -- [Run BLOOM](#run-bloom) - - [Download](#download) - - [Run](#run) - - [Quantization](#quantization) -- [Develop](#develop) -- [Testing](#testing) - +- [Text Generation Inference](#text-generation-inference) + - [Table of contents](#table-of-contents) + - [Features](#features) + - [Optimized architectures](#optimized-architectures) + - [Get started](#get-started) + - [Docker](#docker) + - [API documentation](#api-documentation) + - [Distributed Tracing](#distributed-tracing) + - [A note on Shared Memory (shm)](#a-note-on-shared-memory-shm) + - [Local install](#local-install) + - [CUDA Kernels](#cuda-kernels) + - [Run BLOOM](#run-bloom) + - [Download](#download) + - [Run](#run) + - [Quantization](#quantization) + - [Develop](#develop) + - [Testing](#testing) + ## Features - Serve the most popular Large Language Models with a simple launcher @@ -131,7 +133,7 @@ by setting the address to an OTLP collector with the `--otlp-endpoint` argument. ### A note on Shared Memory (shm) -[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by +[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by `PyTorch` to do distributed training/inference. `text-generation-inference` make use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models. @@ -152,14 +154,14 @@ creating a volume with: and mounting it to `/dev/shm`. -Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that +Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that this will impact performance. ### Local install -You can also opt to install `text-generation-inference` locally. +You can also opt to install `text-generation-inference` locally. -First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least +First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least Python 3.9, e.g. using `conda`: ```shell @@ -181,7 +183,7 @@ sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*' rm -f $PROTOC_ZIP ``` -On MacOS, using Homebrew: +On MacOS, using Homebrew: ```shell brew install protobuf @@ -241,6 +243,11 @@ make router-dev ## Testing ```shell +# python +make python-server-tests +make python-client-tests +# or both server and client tests make python-tests +# rust cargo tests make integration-tests ```