Parallelization Techniques for LLM Inference
Last updated
Last updated
Model parallelism involves distributing the layers of an LLM across multiple GPUs. Consider an LLM with 'L' layers to be distributed over 'G' GPUs. Each GPU is assigned a subset of layers, with the number of layers per GPU being . The challenge is to minimize the communication overhead while maintaining computational balance.Let represent the computational complexity of layer 'i' and the memory requirement. The goal is to find an allocation where is the set of layers assigned to GPU 'g', such that the total communication overhead is minimized and the memory requirements are balanced:
Here, is the maximum memory available per GPU, and the second constraint ensures that the computational load is evenly distributed.
Pipeline parallelism processes multiple instances of the model concurrently. If 'P' instances are processed in parallel, with each instance going through 'S' stages, the throughput can be increased:
The total time per instance is affected by the stage with the maximum latency . To maximize throughput, the system must pipeline stages efficiently and balance the load across stages.
Tensor parallelism partitions the input tensors across GPUs. Given a tensor of size to be split across 'G' GPUs, each GPU receives a sub-tensor of size . The key is to choose an optimal splitting ratio that minimizes the communication overhead while maximizing computational efficiency. Assuming is a tensor representing input data for an LLM, the split tensor can be computed as:
Where must be chosen such that the parallel computation of across GPUs minimizes the overall execution time , which includes both computation and communication costs:
These parallelization techniques, when combined with advanced memory management strategies, can significantly enhance the inference capabilities of LLMs on mid-range GPUs. The mathematical formulations provided offer a glimpse into the complexity of optimizing these systems, taking into account both computational and communication costs to achieve the best performance.
Here, is the computation time for tensor on GPU 'g', and is the communication overhead that depends on the split ratio and the number of GPUs .