StarAI
StarAI Technical Whitepaper
StarAI Technical Whitepaper
  • ๐ŸŒŸOverVIew
    • Brief Introduction
  • AI DePIN
  • AI Agent Layer
    • Architecture of Multimodal Virtual Intelligent Agents
    • Core Technologies of Intelligent Agents: Controllable Output of Large Models & AI Agent ReAct Though
  • Large Model Training and Inference on Distributed Low-Memory GPU Computing Power
    • LLM Inference and GPU Limitations
    • Parallelization Techniques for LLM Inference
    • Memory Management Strategies
    • Theoretical Analysis and Performance
    • Proofs for Parallelization Strategies
    • Memory Management Algorithms
  • StarAI Whitepaper
  • StarAI ไธญๆ–‡็™ฝ็šฎไนฆ
  • ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆCommunity
    • ๐Ÿค–TON MiniApp
    • ๐Ÿ‘พBNBChain APP
    • ๐Ÿ”—X
    • ๐ŸŒWebsite
    • ๐Ÿ—ฃ๏ธChat
    • ๐Ÿ“ขAnnouncement
    • ๐Ÿ–ผ๏ธNFT
Powered by GitBook
On this page
  • Model Parallelism
  • Pipeline Parallelism
  • Tensor Parallelism
  1. Large Model Training and Inference on Distributed Low-Memory GPU Computing Power

Parallelization Techniques for LLM Inference

Model Parallelism

Model parallelism involves distributing the layers of an LLM across multiple GPUs. Consider an LLM with 'L' layers to be distributed over 'G' GPUs. Each GPU is assigned a subset of layers, with the number of layers per GPU being LG \frac{L}{G} GLโ€‹. The challenge is to minimize the communication overhead while maintaining computational balance.Let Ci C_i Ciโ€‹ represent the computational complexity of layer 'i' and Mi M_i Miโ€‹ the memory requirement. The goal is to find an allocation A={a1,a2,...,aG} A = \{a_1, a_2, ..., a_G\} A={a1โ€‹,a2โ€‹,...,aGโ€‹} where ag a_g agโ€‹ is the set of layers assigned to GPU 'g', such that the total communication overhead Ocomm O_{comm} Ocommโ€‹ is minimized and the memory requirements Mreq M_{req} Mreqโ€‹ are balanced:

Aโˆ—=argโกminโกAOcomm(A)A^* = \arg \min_{A} O_{comm}(A)Aโˆ—=argminAโ€‹Ocommโ€‹(A)

s.t.ย โˆ‘iโˆˆagMiโ‰คMmaxย andย โˆ‘iโˆˆagCiโ‰ˆ1Gโˆ‘j=1LCj \text{s.t. } \sum_{i \in a_g} M_i \leq M_{max} \text{ and } \sum_{i \in a_g} C_i \approx \frac{1}{G} \sum_{j=1}^{L} C_j s.t.ย โˆ‘iโˆˆagโ€‹โ€‹Miโ€‹โ‰คMmaxโ€‹ย andย โˆ‘iโˆˆagโ€‹โ€‹Ciโ€‹โ‰ˆG1โ€‹โˆ‘j=1Lโ€‹Cjโ€‹

Here, Mmax M_{max} Mmaxโ€‹ is the maximum memory available per GPU, and the second constraint ensures that the computational load is evenly distributed.

Pipeline Parallelism

Pipeline parallelism processes multiple instances of the model concurrently. If 'P' instances are processed in parallel, with each instance going through 'S' stages, the throughput T T T can be increased:

T=Pร—STotalย timeย perย instance T = \frac{P \times S}{\text{Total time per instance}} T=Totalย timeย perย instancePร—Sโ€‹

The total time per instance is affected by the stage with the maximum latency Max(s1,s2,...,sS) \text{Max}(s_1, s_2, ..., s_S) Max(s1โ€‹,s2โ€‹,...,sSโ€‹). To maximize throughput, the system must pipeline stages efficiently and balance the load across stages.

Tensor Parallelism

Tensor parallelism partitions the input tensors across GPUs. Given a tensor T T T of size Dร—N D \times N Dร—N to be split across 'G' GPUs, each GPU receives a sub-tensor Tg T_g Tgโ€‹ of size DGร—N \frac{D}{G} \times N GDโ€‹ร—N. The key is to choose an optimal splitting ratio R=DG R = \frac{D}{G} R=GDโ€‹ that minimizes the communication overhead while maximizing computational efficiency. Assuming T T T is a tensor representing input data for an LLM, the split tensor Tg T_g Tgโ€‹ can be computed as:

Tg=T((gโˆ’1)ร—R+1):(gร—R),: T_g = T_{((g-1) \times R + 1) : (g \times R), :} Tgโ€‹=T((gโˆ’1)ร—R+1):(gร—R),:โ€‹

Where R R R must be chosen such that the parallel computation of Tg T_g Tgโ€‹ across GPUs minimizes the overall execution time E E E, which includes both computation and communication costs:

E=โˆ‘g=1Geg+c(R,G) E = \sum_{g=1}^{G} e_g + c(R, G) E=โˆ‘g=1Gโ€‹egโ€‹+c(R,G)

Here, eg e_g egโ€‹ is the computation time for tensor Tg T_g Tgโ€‹ on GPU 'g', and c(R,G) c(R, G) c(R,G) is the communication overhead that depends on the split ratio R R R and the number of GPUs G G G.

These parallelization techniques, when combined with advanced memory management strategies, can significantly enhance the inference capabilities of LLMs on mid-range GPUs. The mathematical formulations provided offer a glimpse into the complexity of optimizing these systems, taking into account both computational and communication costs to achieve the best performance.

PreviousLLM Inference and GPU LimitationsNextMemory Management Strategies

Last updated 5 months ago