Parallelization Techniques for LLM Inference

Model Parallelism

Model parallelism involves distributing the layers of an LLM across multiple GPUs. Consider an LLM with 'L' layers to be distributed over 'G' GPUs. Each GPU is assigned a subset of layers, with the number of layers per GPU being LG \frac{L}{G} . The challenge is to minimize the communication overhead while maintaining computational balance.Let Ci C_i represent the computational complexity of layer 'i' and Mi M_i the memory requirement. The goal is to find an allocation A={a1,a2,...,aG} A = \{a_1, a_2, ..., a_G\} where ag a_g is the set of layers assigned to GPU 'g', such that the total communication overhead Ocomm O_{comm} is minimized and the memory requirements Mreq M_{req} are balanced:

A=argminAOcomm(A)A^* = \arg \min_{A} O_{comm}(A)

s.t. iagMiMmax and iagCi1Gj=1LCj \text{s.t. } \sum_{i \in a_g} M_i \leq M_{max} \text{ and } \sum_{i \in a_g} C_i \approx \frac{1}{G} \sum_{j=1}^{L} C_j

Here, Mmax M_{max} is the maximum memory available per GPU, and the second constraint ensures that the computational load is evenly distributed.

Pipeline Parallelism

Pipeline parallelism processes multiple instances of the model concurrently. If 'P' instances are processed in parallel, with each instance going through 'S' stages, the throughput T T can be increased:

T=P×STotal time per instance T = \frac{P \times S}{\text{Total time per instance}}

The total time per instance is affected by the stage with the maximum latency Max(s1,s2,...,sS) \text{Max}(s_1, s_2, ..., s_S) . To maximize throughput, the system must pipeline stages efficiently and balance the load across stages.

Tensor Parallelism

Tensor parallelism partitions the input tensors across GPUs. Given a tensor T T of size D×N D \times N to be split across 'G' GPUs, each GPU receives a sub-tensor Tg T_g of size DG×N \frac{D}{G} \times N . The key is to choose an optimal splitting ratio R=DG R = \frac{D}{G} that minimizes the communication overhead while maximizing computational efficiency. Assuming T T is a tensor representing input data for an LLM, the split tensor Tg T_g can be computed as:

Tg=T((g1)×R+1):(g×R),: T_g = T_{((g-1) \times R + 1) : (g \times R), :}

Where R R must be chosen such that the parallel computation of Tg T_g across GPUs minimizes the overall execution time E E , which includes both computation and communication costs:

E=g=1Geg+c(R,G) E = \sum_{g=1}^{G} e_g + c(R, G)

Here, eg e_g is the computation time for tensor Tg T_g on GPU 'g', and c(R,G) c(R, G) is the communication overhead that depends on the split ratio R R and the number of GPUs G G .

These parallelization techniques, when combined with advanced memory management strategies, can significantly enhance the inference capabilities of LLMs on mid-range GPUs. The mathematical formulations provided offer a glimpse into the complexity of optimizing these systems, taking into account both computational and communication costs to achieve the best performance.

Last updated