From a0e5fc4189417f97de66038f8674e3516bd7e49c Mon Sep 17 00:00:00 2001
From: ehsanmok <6980212+ehsanmok@users.noreply.github.com>
Date: Wed, 26 Apr 2023 18:10:13 -0700
Subject: [PATCH] Update README

---
 README.md | 49 ++++++++++++++++++++++++++++---------------------
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/README.md b/README.md
index 0c63f36b..54ecdbb2 100644
--- a/README.md
+++ b/README.md
@@ -21,22 +21,24 @@ to power LLMs api-inference widgets.
 
 ## Table of contents
 
-- [Features](#features)
-- [Optimized Architectures](#optimized-architectures)
-- [Get Started](#get-started)
-  - [Docker](#docker)
-  - [API Documentation](#api-documentation)
-  - [A note on Shared Memory](#a-note-on-shared-memory-shm)
-  - [Distributed Tracing](#distributed-tracing)
-  - [Local Install](#local-install)
-  - [CUDA Kernels](#cuda-kernels)
-- [Run BLOOM](#run-bloom)
-  - [Download](#download)
-  - [Run](#run)
-  - [Quantization](#quantization)
-- [Develop](#develop)
-- [Testing](#testing)
-  
+- [Text Generation Inference](#text-generation-inference)
+  - [Table of contents](#table-of-contents)
+  - [Features](#features)
+  - [Optimized architectures](#optimized-architectures)
+  - [Get started](#get-started)
+    - [Docker](#docker)
+    - [API documentation](#api-documentation)
+    - [Distributed Tracing](#distributed-tracing)
+    - [A note on Shared Memory (shm)](#a-note-on-shared-memory-shm)
+    - [Local install](#local-install)
+    - [CUDA Kernels](#cuda-kernels)
+  - [Run BLOOM](#run-bloom)
+    - [Download](#download)
+    - [Run](#run)
+    - [Quantization](#quantization)
+  - [Develop](#develop)
+  - [Testing](#testing)
+
 ## Features
 
 - Serve the most popular Large Language Models with a simple launcher
@@ -131,7 +133,7 @@ by setting the address to an OTLP collector with the `--otlp-endpoint` argument.
 
 ### A note on Shared Memory (shm)
 
-[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by 
+[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by
 `PyTorch` to do distributed training/inference. `text-generation-inference` make
 use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.
 
@@ -152,14 +154,14 @@ creating a volume with:
 
 and mounting it to `/dev/shm`.
 
-Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that 
+Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that
 this will impact performance.
 
 ### Local install
 
-You can also opt to install `text-generation-inference` locally. 
+You can also opt to install `text-generation-inference` locally.
 
-First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least 
+First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
 Python 3.9, e.g. using `conda`:
 
 ```shell
@@ -181,7 +183,7 @@ sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
 rm -f $PROTOC_ZIP
 ```
 
-On MacOS, using Homebrew: 
+On MacOS, using Homebrew:
 
 ```shell
 brew install protobuf
@@ -241,6 +243,11 @@ make router-dev
 ## Testing
 
 ```shell
+# python
+make python-server-tests
+make python-client-tests
+# or both server and client tests
 make python-tests
+# rust cargo tests
 make integration-tests
 ```