Hello Paperspace 09月25日 18:02
Monkey模型提升多模态模型性能
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Monkey是一种先进的视觉语言模型,通过将输入图像分割成均匀的块并采用滑动窗口方法,有效提升高分辨率图像的处理能力。它结合LoRA和可训练视觉重采样器,在无需大量预训练的情况下增强细节捕捉和场景理解。Monkey还利用多级描述生成方法,结合BLIP2、PPOCR等系统生成高质量字幕,在密集文本问答等任务中表现优异,展现出比GPT-4V更强的竞争力。

🔍 Monkey通过滑动窗口将高分辨率图像分割成均匀块,结合LoRA和可训练视觉重采样器,在无需大量预训练的情况下有效提升细节捕捉和场景理解能力,支持高达1344x896的分辨率处理。

📊 该模型采用多级描述生成方法,整合BLIP2、PPOCR等系统,通过分层和上下文理解捕捉广泛的视觉细节,生成高质量字幕,尤其在密集文本问答任务中表现突出。

🔧 Monkey通过调整块大小以匹配预训练分辨率(如448x448),利用现有模型高效处理高分辨率输入,同时保持训练数据分布,避免从零开始训练的高成本。

🌐 交叉注意力模块使模型能在视觉编码器提取的特征上聚焦重要图像部分,结合全局图像视角,平衡局部细节分析和大范围结构理解,提升复杂图像理解能力。

📈 实践中,Monkey在自动图像字幕生成、辅助技术、交互式聊天机器人等领域有广泛应用,显著改善用户体验,并提升图像搜索引擎的准确性和相关性。

Vision-language models are among the advanced artificial intelligence AI systems designed to understand and process visual and textual data together. These models are known to combine the capabilities of computer vision and natural language processing tasks. The models are trained to interpret images and generate descriptions about the image, enabling a range of applications such as image captioning, visual question answering, and text-to-image synthesis. These models are trained on large datasets and powerful neural network architectures, which helps the models to learn complex relationships. This, in turn, allows the models to perform the desired tasks. This advanced system opens up possibilities for human-computer interaction and the development of intelligent systems that can communicate similarly to humans.

Large Multimodal Models (LMMs) are quite powerful however they struggle with the high-resolution input and scene understanding. To address these challenges Monkey was recently introduced. Monkey, a vision-language model, processes input images by dividing the input images into uniform patches, with each patch matching the size used in its original vision encoder training (e.g., 448×448 pixels).

This design allows the model to handle high-resolution images. Monkey employs a two-part strategy: first, it enhances visual capture through higher resolution; second, it uses a multi-level description generation method to enrich scene-object associations, creating a more comprehensive understanding of the visual data. This approach improves learning from the data by capturing detailed visuals, enhancing descriptive text generation's effectiveness.

Join our Discord Community

Get started Join the community

Monkey Architecture Overview

The Overall Monkey Architecture (Image Source)

Let's break down this approach step by step.

Image Processing with Sliding Window

LoRA Integration

Maintaining Structural Information

Processing with Visual Encoder and Resampler

Cross-Attention Module

Balancing Detail and Holistic Understanding

This approach improves the model's ability to understand complex images by combining local detail analysis with a global overview, leveraging advanced techniques like LoRA and cross-attention.

Few Key Points

Overall, Monkey offers a sophisticated way to improve resolution and description generation in LMMs by using existing models more efficiently.

How can I do visual Q&A with Monkey?

To run the Monkey Model and experiment with it, we first login to Paperspace and start a notebook, or you can start up a terminal. We highly recommend using an A4000 GPU to run the model.

The NVIDIA A6000 GPU is a powerful graphics card that is known for its exceptional performance in various AI and machine learning applications, including visual question answering (VQA). With its memory and advanced Ampere architecture, the A4000 offers high throughput and efficiency, making it ideal for handling the complex computations required in VQA tasks.

!nvidia-smi

Setup

Bring this project to life

We will run the below code cells. This will clone the repository, and install the requirements.txt file.

git clone https://github.com/Yuliang-Liu/Monkey.gitcd ./Monkeypip install -r requirements.txt

We can run the gradio demo which is fast and easy to use.

 python demo.py

or follow the code along.

from transformers import AutoModelForCausalLM, AutoTokenizercheckpoint = "echo840/Monkey-Chat"model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map='cuda', trust_remote_code=True).eval()tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)tokenizer.padding_side = 'left'tokenizer.pad_token_id = tokenizer.eod_id

The code above loads the pre-trained model and tokenizer from the Hugging Face Transformers library.

"echo840/Monkey-Chat" is the name of the model checkpoint we will load. Next, we will load the model weights and configurations and map the device to CUDA-enabled GPU for faster computation.

img_path = '/notebooks/quick_start_pytorch_images/image 2.png'question = "provide a detailed caption for the image"query = f'<img>{img_path}</img> {question} Answer: 'input_ids = tokenizer(query, return_tensors='pt', padding='longest')attention_mask = input_ids.attention_maskinput_ids = input_ids.input_idspred = model.generate(    input_ids=input_ids.cuda(),    attention_mask=attention_mask.cuda(),    do_sample=False,    num_beams=1,    max_new_tokens=512,    min_new_tokens=1,    length_penalty = 1,    num_return_sequences=1,    output_hidden_states=True,    use_cache=True,    pad_token_id=tokenizer.eod_id,    eos_token_id=tokenizer.eod_id,)response = tokenizer.decode(pred[0][input_ids.size(1):].cpu(), skip_special_tokens=True).strip()print(response)

This code will generate the detailed caption or description or any other output based on the prompt query using Monkey. We will specify the path where we have stored our image and formulating a query string that includes the image reference and the question asking for a caption. Next, the query is tokenised using the 'tokenizer' which converts the input texts into token IDs.

Parameters such as do_sample=False and num_beams=1 ensure deterministic output by disabling sampling. Other parameters like max_new_tokens, min_new_tokens, and length_penalty control the length and nature of the generated sequence. After generation, the output tokens are decoded back into human-readable text, skipping any special tokens, to form the final response, which is a caption describing the image. Finally, we print the generated caption.

Results

We tried the model with an extremely basic image of just a screenshot, and it does fairly well in recognizing what the image is.

Monkey Output:- The image displays a computer interface with a blue background. The main focus is on a machine with the label "A4000" and the price "$0.76/hr". The machine is equipped with 45 GB RAM and 8 CPUs. There are also several buttons and icons on the interface, including a blue "STOP MACHINE" button and a white "Auto-shutdown in 11 hours" button. The text "45 GB RAM" and "8 CPUs" are also visible. The overall layout suggests a user-friendly interface for managing virtual machines.

Monkey Output:- In the image, a golden retriever is captured in mid-air, leaping towards a yellow tennis ball that is suspended in the air. The dog's body is stretched out, with its front paws reaching for the ball, and its tail is raised high in excitement. The background is a lush green lawn, providing a stark contrast to the dog's white fur. The image is taken from a low angle, making the dog appear larger and more imposing. The dog's focused expression and the dynamic pose suggest a moment of intense playfulness.

Prompt:- Provide a detailed caption for the image

Monkey Output:- This image is a painting of a city skyline at night. The sky is filled with swirling yellow and blue colors, creating a starry effect. In the foreground, there is a tall building with a clock on top. The painting also features a large tree with a yellow light shining on it. The overall effect is one of tranquility and beauty, reminiscent of the famous "Starry Night" painting by Vincent van Gogh.

We are quite impressed by the detailed descriptions and captions that provide even the minutest details of the image. The AI-generated caption is truly remarkable!

The below image highlights Monkey's capabilities in various VQA tasks. Monkey analyzes questions, identifies key image elements, perceives minute text, and reasons about objects, and understands visual charts. The figure also demonstrates Monkey's impressive captioning ability, accurately describing objects and providing summaries.

Monkey's results on various tasks (Image Source)

Comparison Results

In qualitative analysis, Monkey was compared with GPT4V and other LMMs on the task of generating detailed captions.

Further experiments have shown that in many cases, Monkey has demonstrated impressive performance compared to GPT4V when it comes to understanding complex text-based inquiries.

The VQA task comparison results in the below figure show that by scaling up the model size, Monkey achieves significant performance advantages in tasks involving dense text. It not only outperforms QwenVL-Chat [3], LLaVA-1.5 [29], and mPLUG-Owl2 [56] but also achieves promising results compared to GPT-4V [42]. This demonstrates the importance of scaling up model size for performance improvement in multimodal large models and validates our method's effectiveness in enhancing their performance.

Monkey’s comparison with GPT-4V, QwenVL-Chat, LLaVA-1.5, and mPLUG-Owl2 on VQA task.

Practical Application

Conclusion

In this article, we discuss the Monkey chat vision model, the model achieved good results when tried with different images to generate captions or even to understand what is in the image. The research claims that the model outperforms various LMMs including GPT-4v. Its enhanced input resolution also significantly improves performance on document images with dense text. Leveraging advanced techniques such as sliding windows and cross-attention effectively balances local and global image perspectives. However, this method is also limited to processing the input images as a maximum of six patches due to the language model's input length constraints, restricting further input resolution expansion.

Despite these limitations, the model shows significant promise in capturing fine details and providing insightful descriptions, particularly for document images with dense text.

We hope you enjoyed reading the article!

Add speed and simplicity to your Machine Learning workflow today

Get startedTalk to an expert

References

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Monkey模型 视觉语言模型 多模态AI 高分辨率图像处理 LoRA技术 交叉注意力机制 图像字幕生成 密集文本问答
相关文章