💙 Gate Square #Gate Blue Challenge# 💙
Show your limitless creativity with Gate Blue!
📅 Event Period
August 11 – 20, 2025
🎯 How to Participate
1. Post your original creation (image / video / hand-drawn art / digital work, etc.) on Gate Square, incorporating Gate’s brand blue or the Gate logo.
2. Include the hashtag #Gate Blue Challenge# in your post title or content.
3. Add a short blessing or message for Gate in your content (e.g., “Wishing Gate Exchange continued success — may the blue shine forever!”).
4. Submissions must be original and comply with community guidelines. Plagiarism or re
The Cambridge Chinese team open-sourced PandaGPT: the first large-scale basic model that swept the "six modes"
Source: Xinzhiyuan
The current large-scale language models, such as ChatGPT, can only accept text as input. Even the upgraded version of GPT-4 only adds the function of image input, and cannot handle other modal data, such as video and audio.
Recently, researchers from Cambridge University, Nara Advanced Institute of Science and Technology, and Tencent jointly proposed and open sourced the general instruction following model PandaGPT model, which is also the first to realize cross-six modalities (image/video, text, audio, depth, thermal and IMU) execute instructions following the underlying model of the data.
Code link:
Without explicit multimodal supervision, PandaGPT demonstrates strong multimodal capabilities to perform complex understanding/reasoning tasks, such as detailed image description generation, writing video-inspired stories, and answering questions about audio. , or multiple rounds of dialogue, etc.
Example
Image-based Q&A:
Multimodal PandaGPT
Compared with the AI model trapped in the computer, human beings have multiple senses to understand the world. They can see a picture and hear various sounds in nature; if the machine can also input multi-modal information, it can be more comprehensive. solve various problems.
Most of the current multimodal research is limited to a single modality, or a combination of text and other modalities, lacking the integrity and complementarity of perceiving and understanding multimodal input.
To make PandaGPT multimodal input capable, the researchers combined ImageBind's multimodal encoder with a large-scale language model Vicuna, both of which have achieved very strong performance in visual and audio-based instruction-following tasks.
At the same time, in order to make the feature spaces of the two models consistent, the researchers used 160,000 open source image-language instruction follow data to train PandaGPT, in which each training instance includes an image and a set of multi-round dialogue data, and the dialogue contains each Human commands and system replies.
To reduce the number of trainable parameters, the researchers only trained the ImageBind representation used to connect Vicuna, and additional LoRA weights on Vicuna's attention module.
It is worth noting that the current version of PandaGPT is trained only with aligned image-text data, but by utilizing the six modalities (image/video, text, audio, depth, thermal and IMU) inherited in the frozen ImageBind encoder PandaGPT exhibits emergent, zero-shot cross-modal capabilities.
limit
Despite PandaGPT's amazing ability to handle multiple modalities and combinations of modalities, there are several ways that PandaGPT can be further improved:
The training process of PandaGPT can be enriched by introducing more alignment data, such as other modalities (audio-text) matching with text
Researchers only use one embedding vector to represent modal content other than text, and more research is needed on fine-grained feature extraction deformities. For example, cross-modal attention mechanisms may be beneficial to performance improvement
PandaGPT currently only uses multimodal information as input, and in the future, it may introduce richer multimedia content on the generation side, such as generating images and text responses in audio.
New benchmarks are also needed to evaluate the ability to combine multimodal inputs
PandaGPT can also exhibit several common pitfalls of existing language models, including hallucinations, toxicity, and stereotyping.
The researchers also pointed out that PandaGPT is currently only a research prototype and cannot be directly used for real-world applications.
Reference materials: