The Cambridge Chinese team open-sourced PandaGPT: the first large-scale basic model that swept the "six modes"

Source: Xinzhiyuan

**Can hear and see, giving the model different senses to understand the world! **

The current large-scale language models, such as ChatGPT, can only accept text as input. Even the upgraded version of GPT-4 only adds the function of image input, and cannot handle other modal data, such as video and audio.

Recently, researchers from Cambridge University, Nara Advanced Institute of Science and Technology, and Tencent jointly proposed and open sourced the general instruction following model PandaGPT model, which is also the first to realize cross-six modalities (image/video, text, audio, depth, thermal and IMU) execute instructions following the underlying model of the data.

Paper link:

Code link:

Without explicit multimodal supervision, PandaGPT demonstrates strong multimodal capabilities to perform complex understanding/reasoning tasks, such as detailed image description generation, writing video-inspired stories, and answering questions about audio. , or multiple rounds of dialogue, etc.

In short, the core innovation of PandaGPT is that it can accept multiple modal inputs at the same time, and naturally combine the semantics of different modalities, surpassing the traditional single-modal analysis, expanding the downstream application scenarios, and getting closer to the implementation of AGI.

Example

Image-based Q&A:

Image-based multi-round question answering:

Video-based Q&A:

Creative writing inspired by images/videos:

Visual Reasoning Ability:

Audio Reasoning Capabilities:

Multi-modal understanding ability of picture + audio:

Multi-modal understanding ability of video + audio:

Multimodal PandaGPT

Compared with the AI model trapped in the computer, human beings have multiple senses to understand the world. They can see a picture and hear various sounds in nature; if the machine can also input multi-modal information, it can be more comprehensive. solve various problems.

Most of the current multimodal research is limited to a single modality, or a combination of text and other modalities, lacking the integrity and complementarity of perceiving and understanding multimodal input.

To make PandaGPT multimodal input capable, the researchers combined ImageBind's multimodal encoder with a large-scale language model Vicuna, both of which have achieved very strong performance in visual and audio-based instruction-following tasks.

At the same time, in order to make the feature spaces of the two models consistent, the researchers used 160,000 open source image-language instruction follow data to train PandaGPT, in which each training instance includes an image and a set of multi-round dialogue data, and the dialogue contains each Human commands and system replies.

To reduce the number of trainable parameters, the researchers only trained the ImageBind representation used to connect Vicuna, and additional LoRA weights on Vicuna's attention module.

During the training process, based on the calculation resources of 8×A100 40G GPU, if the maximum sequence length of Vicuna-13B is set to 400, the training takes about 7 hours.

It is worth noting that the current version of PandaGPT is trained only with aligned image-text data, but by utilizing the six modalities (image/video, text, audio, depth, thermal and IMU) inherited in the frozen ImageBind encoder PandaGPT exhibits emergent, zero-shot cross-modal capabilities.

limit

Despite PandaGPT's amazing ability to handle multiple modalities and combinations of modalities, there are several ways that PandaGPT can be further improved:

  1. The training process of PandaGPT can be enriched by introducing more alignment data, such as other modalities (audio-text) matching with text

  2. Researchers only use one embedding vector to represent modal content other than text, and more research is needed on fine-grained feature extraction deformities. For example, cross-modal attention mechanisms may be beneficial to performance improvement

  3. PandaGPT currently only uses multimodal information as input, and in the future, it may introduce richer multimedia content on the generation side, such as generating images and text responses in audio.

  4. New benchmarks are also needed to evaluate the ability to combine multimodal inputs

  5. PandaGPT can also exhibit several common pitfalls of existing language models, including hallucinations, toxicity, and stereotyping.

The researchers also pointed out that PandaGPT is currently only a research prototype and cannot be directly used for real-world applications.

Reference materials:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)