Parallelization Techniques for LLM Inference

Model Parallelism

Model parallelism involves distributing the layers of an LLM across multiple GPUs. Consider an LLM with 'L' layers to be distributed over 'G' GPUs. Each GPU is assigned a subset of layers, with the number of layers per GPU being $\frac{L}{G}$ . The challenge is to minimize the communication overhead while maintaining computational balance.Let $C_i$ represent the computational complexity of layer 'i' and $M_i$ the memory requirement. The goal is to find an allocation $A = \{a_1, a_2, ..., a_G\}$ where $a_g$ is the set of layers assigned to GPU 'g', such that the total communication overhead $O_{comm}$ is minimized and the memory requirements $M_{req}$ are balanced:

$A^* = \arg \min_{A} O_{comm}(A)$

$\text{s.t. } \sum_{i \in a_g} M_i \leq M_{max} \text{ and } \sum_{i \in a_g} C_i \approx \frac{1}{G} \sum_{j=1}^{L} C_j$

Here, $M_{max}$ is the maximum memory available per GPU, and the second constraint ensures that the computational load is evenly distributed.

Pipeline Parallelism

Pipeline parallelism processes multiple instances of the model concurrently. If 'P' instances are processed in parallel, with each instance going through 'S' stages, the throughput $T$ can be increased:

$T = \frac{P \times S}{\text{Total time per instance}}$

The total time per instance is affected by the stage with the maximum latency $\text{Max}(s_1, s_2, ..., s_S)$ . To maximize throughput, the system must pipeline stages efficiently and balance the load across stages.

Tensor Parallelism

Tensor parallelism partitions the input tensors across GPUs. Given a tensor $T$ of size $D \times N$ to be split across 'G' GPUs, each GPU receives a sub-tensor $T_g$ of size $\frac{D}{G} \times N$ . The key is to choose an optimal splitting ratio $R = \frac{D}{G}$ that minimizes the communication overhead while maximizing computational efficiency. Assuming $T$ is a tensor representing input data for an LLM, the split tensor $T_g$ can be computed as:

$T_g = T_{((g-1) \times R + 1) : (g \times R), :}$

Where $R$ must be chosen such that the parallel computation of $T_g$ across GPUs minimizes the overall execution time $E$ , which includes both computation and communication costs:

$E = \sum_{g=1}^{G} e_g + c(R, G)$

Here, $e_g$ is the computation time for tensor $T_g$ on GPU 'g', and $c(R, G)$ is the communication overhead that depends on the split ratio $R$ and the number of GPUs $G$ .

These parallelization techniques, when combined with advanced memory management strategies, can significantly enhance the inference capabilities of LLMs on mid-range GPUs. The mathematical formulations provided offer a glimpse into the complexity of optimizing these systems, taking into account both computational and communication costs to achieve the best performance.

PreviousLLM Inference and GPU Limitations NextMemory Management Strategies

Last updated 8 months ago