StarAI Technical Overview and Pros

1. AI DePIN

1.1 Decentralized Hybrid Computing Power Scheduling Algorithm

- Supported Device Types

StarAI DePIN is compatible with Linux, Windows, and macOS devices without requiring additional adaptations. It supports both NVIDIA and AMD GPUs, with computing power varying based on GPU, CPU, and memory capacity.

Additionally, iPhone and Android devices can also connect to the network, but due to their limited computing power, they can only run small-scale models.

- Computing Power Scheduling

Computing Power Access: Different devices are assigned AI models based on their computing capabilities:

  • Professional-grade GPUs (e.g., A100) can run large-scale models (e.g., LLaMA3 70B) with higher inference accuracy.

  • Consumer-grade GPUs (e.g., GTX 3060) are suitable for small to mid-sized models (e.g., Phi3 3.8B).

  • CPU-only devices can only support smaller models.

  • StarAI DePIN evaluates device computing power and deploys AI models accordingly, ensuring efficient resource utilization without disrupting daily tasks.

Computing Power Matching: The platform connects global devices and dynamically assigns tasks based on multiple factors:

  • Model Matching: Tasks are assigned to devices best suited for their specific inference requirements.

  • Load Matching: Tasks are prioritized for devices with lower workloads, maximizing efficiency.

  • Geographic Matching: Tasks are assigned to nearby devices first, minimizing network latency.

- Device Failure Handling

  • The platform conducts regular health checks to ensure tasks are assigned only to functional devices.

  • If a device fails multiple times, its task priority is lowered. Once the device recovers, it can be reassigned tasks.

  • If a task fails due to device malfunction, the system will automatically transfer it to a healthy device for execution.

- Computing Power Deactivation

  • Computing power providers can voluntarily opt out at any time, terminating task execution.

  • The platform logs device failure records, and devices with frequent malfunctions may be blacklisted and forcibly deactivated.


2. AI Agent Layer

Our multimodal virtual intelligent agent serves as an agent with independent personality traits while handling multimodal inputs and outputs. The agent can be endowed with unique personality and background stories, and with short-term and long-term memory support, it responds differently to different users. It handles text, voice, and image inputs/outputs, enabling exceptional user interaction.

Hierarchical Architecture

The agent's architecture consists of four layers:

  1. Input/Output Handlers: Process and convert data

  2. Agent Framework and Centers: Manage personality, tools, and memory

  3. Storage System: Interface with storage services

  4. Model Foundation: Provide model capabilities

The Input Handler processes raw user input (text, voice, or image) into uniform text format. The Output Handler converts agent responses into appropriate formats (text, voice, or image) using technologies like Text-to-Speech and Stable Diffusion.


3. Core Components

  • Prompt Center: Manages prompts for different agent personalities

  • Agent Framework:

    • ReAct Agent: Handles logical thinking and tool usage

    • Memory Agent: Maintains long-term memory

  • Tools Center: Maintains RAG tool, Text To Image tool, Python tool, etc.

  • Memory Center: Manages short-term (conversation history) and long-term memories


4. Conversation Processing Workflow

The process flow begins with user input (text, speech, or image) through the Input Router. Speech inputs are converted to text via Speech-to-Text, images via Image-to-Text, while text input is processed directly. The resulting text is sent to the ReAct Agent, which accesses short-term and long-term memories and may use various tools (RAG, Text-to-Image, etc.) to build a response. The output is saved to short-term memory and routed through the Output Router, which converts it to the appropriate format (text, speech, or image) before delivery. Simultaneously, the Memory Agent can extract and summarize information from short-term memory to update long-term memory.


5. Core Technologies

  • ReAct Thought Chain: ReAct is a prompting technique that integrates reasoning, action planning, and knowledge source assimilation. It consists of:

    • Main Instruction: Initiates model understanding

    • ReAct Steps: Outlines reasoning and action planning

    • Reasoning: Uses chain-of-thought approach

    • Actions: Provides action choices after reasoning

  • Retrieval-Augmented Generation (RAG): RAG combines information retrieval with generative models to produce accurate, contextually relevant outputs. It enhances agent configuration by accessing comprehensive knowledge bases during user interactions, improving decision-making and recommendations.

  • T2V/I2V Models

    The model based on a hybrid architecture combining Variational Autoencoder (VAE) and Diffusion Transformer (DiT), it enhances temporal modeling and scene understanding capabilities. It offers robust capabilities for text-to-video and image-to-video generation, as well as video editing and video-to-audio tasks. This model also capable of generating text in both Chinese and English, endowed with robust text generation capabilities, thereby enhancing its practicality. It can produce cinematic-quality text and animations.

  • Planning Framework with Task Decomposition: This framework enables systematic problem-solving by breaking down complex tasks into manageable components. It improves efficiency in agent creation by dividing the process into discrete tasks like concept development, voice selection, and personality definition.

  • Multi-Agent Collaboration: StarAI implements specialized agents working together:

    • Dialog Agent: Manages user interactions

    • Configuration Agent: Handles technical setup

    • Quality Assurance Agent: Monitors and optimizes process


6. Large Language Model Training and Inference

6.1 Parallelization Techniques

  • Model Parallelism: Model parallelism involves distributing the layers of an LLM across multiple GPUs. Consider an LLM with L L layers to be distributed over G G GPUs. Each GPU is assigned a subset of layers, with the number of layers per GPU being LG \frac{L}{G} . The challenge is to minimize the communication overhead while maintaining computational balance.

  • Let Ci C_i represent the computational complexity of layer i i and Mi M_i the memory requirement. The goal is to find an allocation A={a1,a2,...,aG} A = \{a_1, a_2, ..., a_G\} where ag a_g is the set of layers assigned to GPU g g , such that the total communication overhead Ocomm O_{comm} is minimized and the memory requirements Mreq M_{req} are balanced:

A=argminAOcomm(A)A^* = \arg \min_{A} O_{comm}(A)

s.t. iagMiMmax and iagCi1Gj=1LCj \text{s.t. } \sum_{i \in a_g} M_i \leq M_{max} \text{ and } \sum_{i \in a_g} C_i \approx \frac{1}{G} \sum_{j=1}^{L} C_j

Here, Mmax M_{max} is the maximum memory available per GPU, and the second constraint ensures that the computational load is evenly distributed.

6.2 Pipeline Parallelism

Pipeline parallelism processes multiple instances of the model concurrently. If P P instances are processed in parallel, with each instance going through S S stages, the throughput T T can be increased:

T=P×STotal time per instance T = \frac{P \times S}{\text{Total time per instance}}

The total time per instance is affected by the stage with the maximum latency Max(s1,s2,...,sS) \text{Max}(s_1, s_2, ..., s_S) . To maximize throughput, the system must pipeline stages efficiently and balance the load across stages.

6.3 Tensor Parallelism

Tensor parallelism partitions the input tensors across GPUs. Given a tensor T T of size D×N D \times N to be split across G G GPUs, each GPU receives a sub-tensor Tg T_g of size DG×N \frac{D}{G} \times N .

The key is to choose an optimal splitting ratio R=DG R = \frac{D}{G} that minimizes the communication overhead while maximizing computational efficiency.Assuming T T is a tensor representing input data for an LLM, the split tensor Tg T_g can be computed as:

Tg=T((g1)×R+1):(g×R),: T_g = T_{((g-1) \times R + 1) : (g \times R), :}

Where R R must be chosen such that the parallel computation of Tg T_g across GPUs minimizes the overall execution time E E , which includes both computation and communication costs:

E=g=1Geg+c(R,G) E = \sum_{g=1}^{G} e_g + c(R, G)


7. Memory Management

7.1 Dynamic Memory Allocation:

Let L L be the length of the input sequence, M(L) M(L) the memory required for a sequence of length L L , and B B the maximum memory block size. The memory allocation A(L) A(L) for a sequence of length L L is given by:

A(L)=min(M(L),B) A(L) = \min(M(L), B)

This ensures that memory allocation is proportional to the sequence length, preventing unnecessary memory usage.

7.2 Paged Memory Management:

Divides memory into fixed-size pages, maintaining a page table mapping logical to physical pages. The efficiency is measured by minimizing page faults and maximizing reuse.

7.3 Copy-on-Write Mechanism:

For a memory block B B shared by n n sequences, the memory saving SCOW S_{COW} is:

SCOW=n×Size(B)×(1Modified PortionSize(B)) S_{COW} = n \times \text{Size}(B) \times (1 - \frac{\text{Modified Portion}}{\text{Size}(B)})

This defers copying until modification occurs.

7.4 Swapping and Recomputation:

  • Swapping moves less-accessed data to auxiliary memory, with swap operation:

Sswap=Size(B)×Swap Rate S_{swap} = \text{Size}(B) \times \text{Swap Rate}

  • Recomputation recalculates evicted data with overhead:

Srecompute=Computational Cost×Recompute Rate S_{recompute} = \text{Computational Cost} \times \text{Recompute Rate}

The choice between swapping and recomputation considers their relative costs and current memory state.

  • The total execution time Etotal E_{total} includes both computation Ccomp C_{comp} and communication Ccomm C_{comm} times:

Etotal=Ccomp+Ccomm E_{total} = C_{comp} + C_{comm}

These sophisticated strategies enable efficient LLM inference on mid-range GPUs, allowing larger models and increased throughput with limited memory resources. The system's efficiency E E can be calculated as:

E=SactualG E = \frac{S_{actual}}{G}

Where Sactual S_{actual} is the actual speedup achieved with G G GPUs, reflecting the real-world performance of these optimization techniques.