Architecture of Multimodal Virtual Intelligent Agents
Last updated
Last updated
The goal of our multimodal virtual intelligent agent is to serve as an agent with independent personality traits while also possessing the ability to handle multimodal inputs and outputs. From the perspective of an agent, our multimodal virtual intelligent agent allows users to endow it with unique personality and background stories, thereby creating rich and distinctive personality traits. Additionally, with the support of short-term and long-term memory, the multimodal virtual intelligent agent is capable of responding differently to different users. The multimodal aspect is reflected in the agent's ability to accept and produce various types of inputs and outputs. Our multimodal virtual intelligent agent can handle and provide inputs and outputs in the form of text, voice, and images. This rich multimodal input and output capability enables the multimodal virtual intelligent agent to perform exceptionally well when interacting with users.
And then, let's begin to introduce the architecture of our multimodal virtual intelligent agent.
The above diagram illustrates the hierarchical architecture of the multimodal virtual intelligent agent. We have divided the entire agent's architecture into four layers.
The first layer includes the agent's Input Handler and Output Handler, which provide data processing and conversion capabilities for the agent's inputs and outputs.
The second layer comprises the Agent Framework and three centers, which respectively endow the agent with personality traits, provide tools for interacting with the external world, and establish a memory system.
The third layer serves as the agent's storage system, interfacing with various storage services. The fourth layer is the agent's model foundation, granting the agent the capabilities of various models.
The Input Handler is responsible for processing the raw information provided by the user. At this stage, the user's input may be one of three different types: text, voice, or image.
Text: The Input Handler does not need to perform any processing on the text.
Voice: The Input Handler uses Speech-to-Text (STT) to process the voice input, converting it into text.
Image: The Input Handler uses a Vision-Language Model (VLM) to process the image, generating a description of the image. If the user's input is mixed data, such as image + text or image + voice, the input data is first unified into image + text. Then, the image and text are processed simultaneously using the VLM, generating a text output.
After processing by the Input Handler, the user's input is uniformly converted into text input.
The Output Handler is responsible for processing the raw information output by the Agent. The raw output information from the Agent is always in text form but includes tags indicating the target information type. There are three types of tags: text, voice, and image.
Text: Indicates that the output information from the Agent is text that needs to be directly output, and the Output Handler does not need to perform any processing.
Voice: Indicates that the output information from the Agent needs to be converted into voice. The Output Handler will use Text-to-Speech (TTS) to convert the text information into voice, and then output the voice.
Image: Indicates that the output information from the Agent is a description for generating an image. The Output Handler will use this text-formatted image description and generate the corresponding image through Stable Diffusion, then output the image.
After processing by the Output Handler, the Agent's output is transformed into multimodal output.
The Prompt Center is responsible for managing all the prompts of the multimodal virtual intelligent agents. These prompts are crucial for endowing different agents with distinct personalities and background stories.
Currently, we have two types of agent frameworks, each applied in different scenarios:
ReAct Agent: The ReAct Agent has the capability to use tools. We use this framework as the logical thinking main framework for multimodal virtual intelligent agents. It is responsible for responding to user input, thinking, and generating raw output. Due to its ability to use tools, it also has the capability to actively perceive external environmental information and modify the external environment.
Memory Agent: The Memory Agent is used to maintain the long-term memory of multimodal virtual intelligent agents. It is responsible for extracting entity information based on the agent's current short-term memory, summarizing and generalizing it, and generating or modifying the corresponding long-term memory.
The Tools Center is responsible for maintaining all the tools that agents can use and can generate different tool configurations based on different agents. The tools maintained by the Tools Center include the RAG tool, Text To Image tool, Python tool, and others.
The Memory Center is responsible for maintaining both the short-term and long-term memories of agents and can generate different memory configurations based on different agents. An agent's short-term memory primarily consists of conversation history, while long-term memory includes various forms, such as entity memory, historical conversation summaries, and more.
Here is the process flow for a multimodal virtual agent handling a round of user input:
The process starts with the user providing input, which can be in the form of text, speech, or an image.
The user input is first routed through the Input Router. If the input is speech, it is sent to the Speech to Text module and converted to text. If the input is an image, it is sent to the Image to Text module and also converted to text. If the input is already in text form, it is directly processed.
The text from user or converted from speech or image is sent to the ReAct Agent. The ReAct Agent interacts with the Memory module to load short-term and long-term memories. And the ReAct Agent may invoke various tools (such as RAG tool, Text-to-Image tool, etc.) as needed and receive results from these tools. This helps the agent build a chain of thought and output the next text response.
After the text output is built, the response is saved in to short-term memory. And the text output will be sent to Output Router and then converted to speech or image though Text-To-Speech module or Text-To-Image module.
The final output is delivered to the user in the form of text or speech as determined by the Output Router. Meanwhile, the Memory Agent can trigger a process of transferring information from short-term memory to long-term memory through the ReAct Agent and a timer. By reading the short-term memory and using methods such as entity extraction and summarization, it extracts information for long-term memory.