Monkey模型提升多模态模型性能

Vision-language models are among the advanced artificial intelligence AI systems designed to understand and process visual and textual data together. These models are known to combine the capabilities of computer vision and natural language processing tasks. The models are trained to interpret images and generate descriptions about the image, enabling a range of applications such as image captioning, visual question answering, and text-to-image synthesis. These models are trained on large datasets and powerful neural network architectures, which helps the models to learn complex relationships. This, in turn, allows the models to perform the desired tasks. This advanced system opens up possibilities for human-computer interaction and the development of intelligent systems that can communicate similarly to humans.

Large Multimodal Models (LMMs) are quite powerful however they struggle with the high-resolution input and scene understanding. To address these challenges Monkey was recently introduced. Monkey, a vision-language model, processes input images by dividing the input images into uniform patches, with each patch matching the size used in its original vision encoder training (e.g., 448×448 pixels).

This design allows the model to handle high-resolution images. Monkey employs a two-part strategy: first, it enhances visual capture through higher resolution; second, it uses a multi-level description generation method to enrich scene-object associations, creating a more comprehensive understanding of the visual data. This approach improves learning from the data by capturing detailed visuals, enhancing descriptive text generation's effectiveness.

Join our Discord Community

Get started

Monkey Architecture Overview

The Overall Monkey Architecture (Image Source)

Let's break down this approach step by step.

Image Processing with Sliding Window

Input Image

Sliding Window

LoRA Integration

LoRA (Low-Rank Adaptation)

Maintaining Structural Information

Global Image Resizing

Processing with Visual Encoder and Resampler

Concurrent Processing

Visual Resampler

Summarizing Visual Information

Obtaining Higher Semantic Representations

Cross-Attention Module

Cross-Attention Mechanism

Balancing Detail and Holistic Understanding

Balanced Approach

This approach improves the model's ability to understand complex images by combining local detail analysis with a global overview, leveraging advanced techniques like LoRA and cross-attention.

Few Key Points

Resource-Efficient Input Resolution Increase

Maintaining Training Data Distribution

Trainable Patches Advantage

Automatic Multi-Level Description Generation

Advantages of Monkey

High-Resolution Support

Improved Contextual Associations

Performance Enhancements

Overall, Monkey offers a sophisticated way to improve resolution and description generation in LMMs by using existing models more efficiently.

How can I do visual Q&A with Monkey?

To run the Monkey Model and experiment with it, we first login to Paperspace and start a notebook, or you can start up a terminal. We highly recommend using an A4000 GPU to run the model.

The NVIDIA A6000 GPU is a powerful graphics card that is known for its exceptional performance in various AI and machine learning applications, including visual question answering (VQA). With its memory and advanced Ampere architecture, the A4000 offers high throughput and efficiency, making it ideal for handling the complex computations required in VQA tasks.

!nvidia-smi

Setup

Bring this project to life

Run on Paperspace

We will run the below code cells. This will clone the repository, and install the requirements.txt file.

git clone https://github.com/Yuliang-Liu/Monkey.gitcd ./Monkeypip install -r requirements.txt

We can run the gradio demo which is fast and easy to use.

 python demo.py

or follow the code along.

from transformers import AutoModelForCausalLM, AutoTokenizercheckpoint = "echo840/Monkey-Chat"model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map='cuda', trust_remote_code=True).eval()tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)tokenizer.padding_side = 'left'tokenizer.pad_token_id = tokenizer.eod_id

The code above loads the pre-trained model and tokenizer from the Hugging Face Transformers library.

"echo840/Monkey-Chat" is the name of the model checkpoint we will load. Next, we will load the model weights and configurations and map the device to CUDA-enabled GPU for faster computation.

img_path = '/notebooks/quick_start_pytorch_images/image 2.png'question = "provide a detailed caption for the image"query = f'<img>{img_path}</img> {question} Answer: 'input_ids = tokenizer(query, return_tensors='pt', padding='longest')attention_mask = input_ids.attention_maskinput_ids = input_ids.input_idspred = model.generate(    input_ids=input_ids.cuda(),    attention_mask=attention_mask.cuda(),    do_sample=False,    num_beams=1,    max_new_tokens=512,    min_new_tokens=1,    length_penalty = 1,    num_return_sequences=1,    output_hidden_states=True,    use_cache=True,    pad_token_id=tokenizer.eod_id,    eos_token_id=tokenizer.eod_id,)response = tokenizer.decode(pred[0][input_ids.size(1):].cpu(), skip_special_tokens=True).strip()print(response)

This code will generate the detailed caption or description or any other output based on the prompt query using Monkey. We will specify the path where we have stored our image and formulating a query string that includes the image reference and the question asking for a caption. Next, the query is tokenised using the 'tokenizer' which converts the input texts into token IDs.

Parameters such as do_sample=False and num_beams=1 ensure deterministic output by disabling sampling. Other parameters like max_new_tokens, min_new_tokens, and length_penalty control the length and nature of the generated sequence. After generation, the output tokens are decoded back into human-readable text, skipping any special tokens, to form the final response, which is a caption describing the image. Finally, we print the generated caption.

Results

We tried the model with an extremely basic image of just a screenshot, and it does fairly well in recognizing what the image is.

Monkey Output:- The image displays a computer interface with a blue background. The main focus is on a machine with the label "A4000" and the price "$0.76/hr". The machine is equipped with 45 GB RAM and 8 CPUs. There are also several buttons and icons on the interface, including a blue "STOP MACHINE" button and a white "Auto-shutdown in 11 hours" button. The text "45 GB RAM" and "8 CPUs" are also visible. The overall layout suggests a user-friendly interface for managing virtual machines.

Monkey Output:- In the image, a golden retriever is captured in mid-air, leaping towards a yellow tennis ball that is suspended in the air. The dog's body is stretched out, with its front paws reaching for the ball, and its tail is raised high in excitement. The background is a lush green lawn, providing a stark contrast to the dog's white fur. The image is taken from a low angle, making the dog appear larger and more imposing. The dog's focused expression and the dynamic pose suggest a moment of intense playfulness.

Prompt:- Provide a detailed caption for the image

Monkey Output:- This image is a painting of a city skyline at night. The sky is filled with swirling yellow and blue colors, creating a starry effect. In the foreground, there is a tall building with a clock on top. The painting also features a large tree with a yellow light shining on it. The overall effect is one of tranquility and beauty, reminiscent of the famous "Starry Night" painting by Vincent van Gogh.

We are quite impressed by the detailed descriptions and captions that provide even the minutest details of the image. The AI-generated caption is truly remarkable!

The below image highlights Monkey's capabilities in various VQA tasks. Monkey analyzes questions, identifies key image elements, perceives minute text, and reasons about objects, and understands visual charts. The figure also demonstrates Monkey's impressive captioning ability, accurately describing objects and providing summaries.

Monkey's results on various tasks (Image Source)

Comparison Results

In qualitative analysis, Monkey was compared with GPT4V and other LMMs on the task of generating detailed captions.

Monkey and GPT-4V identified an "Emporio Armani" store in the background, with Monkey providing additional details, such as a woman in a red coat and black pants carrying a black purse. (Image Source)

Further experiments have shown that in many cases, Monkey has demonstrated impressive performance compared to GPT4V when it comes to understanding complex text-based inquiries.

The VQA task comparison results in the below figure show that by scaling up the model size, Monkey achieves significant performance advantages in tasks involving dense text. It not only outperforms QwenVL-Chat [3], LLaVA-1.5 [29], and mPLUG-Owl2 [56] but also achieves promising results compared to GPT-4V [42]. This demonstrates the importance of scaling up model size for performance improvement in multimodal large models and validates our method's effectiveness in enhancing their performance.

Monkey’s comparison with GPT-4V, QwenVL-Chat, LLaVA-1.5, and mPLUG-Owl2 on VQA task.

Practical Application

Automated Image Captioning

Assistive Technologies

Interactive Chatbots

Image-Based Search Engines

Conclusion

In this article, we discuss the Monkey chat vision model, the model achieved good results when tried with different images to generate captions or even to understand what is in the image. The research claims that the model outperforms various LMMs including GPT-4v. Its enhanced input resolution also significantly improves performance on document images with dense text. Leveraging advanced techniques such as sliding windows and cross-attention effectively balances local and global image perspectives. However, this method is also limited to processing the input images as a maximum of six patches due to the language model's input length constraints, restricting further input resolution expansion.

Despite these limitations, the model shows significant promise in capturing fine details and providing insightful descriptions, particularly for document images with dense text.

We hope you enjoyed reading the article!

Add speed and simplicity to your Machine Learning workflow today

Get started

References

Monkey : Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Monkey-Chat Github