StarAI
StarAI Technical Whitepaper
StarAI Technical Whitepaper
  • 🌟OverVIew
    • Brief Introduction
  • AI DePIN
  • AI Agent Layer
    • Architecture of Multimodal Virtual Intelligent Agents
    • Core Technologies of Intelligent Agents: Controllable Output of Large Models & AI Agent ReAct Though
  • Large Model Training and Inference on Distributed Low-Memory GPU Computing Power
    • LLM Inference and GPU Limitations
    • Parallelization Techniques for LLM Inference
    • Memory Management Strategies
    • Theoretical Analysis and Performance
    • Proofs for Parallelization Strategies
    • Memory Management Algorithms
  • StarAI Whitepaper
  • StarAI 中文白皮书
  • 👨‍👩‍👧‍👦Community
    • 🤖TON MiniApp
    • 👾BNBChain APP
    • 🔗X
    • 🌐Website
    • 🗣️Chat
    • 📢Announcement
    • 🖼️NFT
Powered by GitBook
On this page
  • Dynamic Memory Allocation
  • Paged Memory Management
  • Copy-on-Write Mechanism
  • Swapping and Recomputation
  1. Large Model Training and Inference on Distributed Low-Memory GPU Computing Power

Memory Management Strategies

PreviousParallelization Techniques for LLM InferenceNextTheoretical Analysis and Performance

Last updated 5 months ago

Effective memory management is a cornerstone for efficient LLM inference on mid-range GPUs. The strategies outlined below are inspired by principles from operating systems and are tailored to address the unique challenges posed by LLMs.

Dynamic Memory Allocation

Dynamic memory allocation is essential for handling variable-length input sequences common in LLM inference. Instead of allocating a fixed, maximum-sized block of memory for each sequence, we allocate memory based on the actual sequence length. This approach significantly reduces memory waste due to over-provisioning.

Let L L L be the length of the input sequence, M(L) M(L) M(L) the memory required for a sequence of length L L L, and B B B the maximum memory block size. The memory allocation A(L) A(L) A(L) for a sequence of length L L L is given by:

A(L)=min⁡(M(L),B) A(L) = \min(M(L), B) A(L)=min(M(L),B)

This ensures that memory allocation is proportional to the sequence length, preventing unnecessary memory usage.

Paged Memory Management

Paged memory management, analogous to virtual memory in operating systems, involves dividing the memory into fixed-size pages. This approach allows for efficient memory utilization and the ability to share memory between different inference tasks.

For a KV cache requiring P P P pages, and each page being of size S S S, the memory manager maintains a page table that maps logical pages to physical pages. The memory manager's efficiency is characterized by its ability to minimize page faults and maximize page reuse.

Copy-on-Write Mechanism

The copy-on-write (COW) mechanism is a memory optimization technique that comes into play during the inference process when multiple sequences share common prefixes. Instead of duplicating the entire memory block when a write operation is required, COW defers the copy until the actual modification occurs.

Given a memory block B B B shared by n n n sequences, the COW mechanism ensures that only the modified portion of B B B is copied. The memory saving SCOW S_{COW} SCOW​ can be expressed as:

SCOW=n×Size(B)×(1−Modified PortionSize(B)) S_{COW} = n \times \text{Size}(B) \times (1 - \frac{\text{Modified Portion}}{\text{Size}(B)}) SCOW​=n×Size(B)×(1−Size(B)Modified Portion​)

This formula captures the memory saving achieved by deferring the copy operation until it is necessary.

Swapping and Recomputation

Swapping and recomputation are two strategies to handle memory eviction when the GPU memory is fully utilized.

The decision to swap or recompute is based on the relative costs and the current memory state.

By integrating these memory management strategies, we can significantly enhance the inference capabilities of LLMs on mid-range GPUs, allowing them to handle larger models and increased throughput with limited memory resources.

- Swapping involves moving less frequently accessed data to a slower, auxiliary memory (such as system RAM or SSD). When the data is needed again, it is swapped back into the GPU memory. The swap operation Sswap S_{swap} Sswap​ is modeled as:

Sswap=Size(B)×Swap Rate S_{swap} = \text{Size}(B) \times \text{Swap Rate} Sswap​=Size(B)×Swap Rate

- Recomputation is an alternative to swapping that involves recalculating the evicted data when it is required. This is particularly useful for data that can be recomputed from other available data without loss of information. The recomputation overhead Srecompute S_{recompute} Srecompute​ is given by:

Srecompute=Computational Cost×Recompute Rate S_{recompute} = \text{Computational Cost} \times \text{Recompute Rate} Srecompute​=Computational Cost×Recompute Rate