mirror of
https://github.com/huggingface/text-generation-inference.git
synced 2025-09-10 20:04:52 +00:00
Update docs/source/conceptual/tensor_parallelism.md
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
This commit is contained in:
parent
27baaeffe0
commit
1e828f33c0
@ -1,6 +1,6 @@
|
||||
# Tensor Parallelism
|
||||
|
||||
Tensor parallelism (also called horizontal model parallelism) is a technique used to fit a large model in multiple GPUs. Intermediate outputs between ranks are sent and received from one rank to another in a synchronous or asynchronous manner. When multiplying input with weights for inference, multiplying input with weights directly is equivalent to dividing the weight matrix column-wise, multiplying each column with input separately, and then concatenating the separate outputs like below 👇
|
||||
Tensor parallelism is a technique used to fit a large model in multiple GPUs. Intermediate outputs between GPUs are sent and received from one GPU to another in a synchronous or asynchronous manner. For example, when multiplying the input tensors with the first weights tensor, multiplying both tensors is equivalent to splitting the weight tensor column-wise, multiplying each column with input separately, and then concatenating the separate outputs like below 👇
|
||||
|
||||

|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user