From 27baaeffe0c5f588b5f95b0274c0685fbebf5d78 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 22 Aug 2023 14:50:16 +0300 Subject: [PATCH] Update tensor_parallelism.md --- docs/source/conceptual/tensor_parallelism.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/source/conceptual/tensor_parallelism.md b/docs/source/conceptual/tensor_parallelism.md index 5075f3f7..7cf64a59 100644 --- a/docs/source/conceptual/tensor_parallelism.md +++ b/docs/source/conceptual/tensor_parallelism.md @@ -1,13 +1,15 @@ # Tensor Parallelism -Tensor Paralellism (also called horizontal model paralellism) is a technique used to fit a large model in multiple GPUs. Model parallelism enables large model training and inference by putting different layers in different GPUs (also called ranks). Intermediate outputs between ranks are sent and received from one rank to another in a synchronous or asynchronous manner. When multiplying input with weights for inference, multiplying input with weights directly is equivalent to dividing weight matrix column-wise, multiplying each column with input separately, and then concatenating the separate outputs like below 👇 +Tensor parallelism (also called horizontal model parallelism) is a technique used to fit a large model in multiple GPUs. Intermediate outputs between ranks are sent and received from one rank to another in a synchronous or asynchronous manner. When multiplying input with weights for inference, multiplying input with weights directly is equivalent to dividing the weight matrix column-wise, multiplying each column with input separately, and then concatenating the separate outputs like below 👇 ![Image courtesy of Anton Lozkhov](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/TP.png) -In TGI, tensor parallelism is implemented under the hood by sharding weights and placing them in different ranks. The matrix multiplications then take place in different ranks and are then gathered into single tensor. +In TGI, tensor parallelism is implemented under the hood by sharding weights and placing them in different ranks. The matrix multiplications then take place in different ranks and are then gathered into a single tensor. -Tensor Parallelism only works for model officially supported, it will not work when falling back on `transformers`. +Tensor Parallelism only works for models officially supported, it will not work when falling back on `transformers`. + +You can learn more in-depth on tensor-parallelism from transformers docs in this [link](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_many#tensor-parallelism).