MiniGPT4 is open source!

ChatGPT’s ability to converse is something we’ve all experienced, whether it’s in its text generation abilities, coding skills, or contextualized conversation capabilities – each time leaving us in awe.

Do you remember the GPT4 press conference where they showcased its multimodal ability? Input could not only be text, but also text and images.

For example, input: (see image) what happens when a glove falls off?

Output: it falls onto the wooden board and the ball is launched.

MiniGPT4 is open source!

Furthermore, by simply sketching out a website, GPT4 can immediately generate HTML code for the site.

MiniGPT4 is open source!

However, over a month has passed since the conference, and OpenAI has yet to provide the showcased multimodal processing capabilities.

Initially, many thought an official update was necessary to experience this new feature, but to everyone’s surprise, a project called MiniGPT-4 arose.

MiniGPT-4 is an open-source project developed by several doctoral students at King Abdulaziz University of Science and Technology.

Most importantly, it’s completely open source and the end result is demonstrated in this video:

As we can see, MiniGPT-4 can support both text and image input, achieving multimodal input.


Online demo:

The author also provides a web demo for anyone to try:

MiniGPT4 is open source!

MiniGPT-4 was trained on a few large open-source models, which were fine-tuned in two phases: first, by training on 5 million image-text pairs on 4 A100s, then fine-tuning on small high-quality datasets. Training on a single A100 only takes 7 minutes.

MiniGPT4 is open source!

However, there are currently many users using this service, so it’s best to use off-peak hours or deploy a local service.

The local deployment process is not complicated – just follow the official tutorial to configure the environment:

git clone
cd MiniGPT-4
conda env create -f environment.yml
conda activate minigpt4

Then download the pre-trained model:

Finally, run the command:

python –cfg-path eval_configs/minigpt4_eval.yaml

During this process, ensure that you have a network connection and the necessary dependencies downloaded, such as BLIP.

Soon, it’s likely not only to be multimodal input, but also multimodal output.

We’ll be able to input text, images, audio, and video, and AI will generate the required text, images, audio, or even videos based on our needs.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.