HKU and ByteDance propose a new paradigm of multimodal large models, simulating human perception before cognition

Mon, May 27 2024 08:25 AM EST

Currently, multimodal large models (MLLM) have demonstrated strong cognitive understanding capabilities across various visual tasks.

However, most multimodal large models are limited to one-way image understanding, making it difficult to map the understood content back to the image.

For instance, while the model can easily identify objects in an image, it struggles to accurately pinpoint the location of these objects within the image.

The lack of localization ability directly hinders the applications of multimodal large models in downstream fields such as image editing, autonomous driving, and robot control.

To address this issue, researchers from the University of Hong Kong and ByteDance's commercialization team have proposed a new paradigm called Groma—

By incorporating regional image encoding to enhance the perceptual localization ability of multimodal large models.

With the integration of localization, Groma can directly associate text content with image regions, significantly improving the interactivity and directionality of conversations.

Core Idea

How to endow multimodal large models with the ability to locate objects, even associating text content with image regions to achieve "substantiality," is currently a major research focus.

A common approach is fine-tuning large language models to directly output object coordinates. However, this method has several limitations:

Large language models pre-trained on text lack spatial understanding capabilities, making it difficult to precisely locate objects with only a small amount of fine-tuning data.
The localization task requires a high resolution of input images, but increasing the resolution significantly increases the computational load of multimodal large models.
The output format of large language models is not suitable for handling fine localization tasks, such as segmentation.

Considering these factors, Groma proposes transferring the localization task to the vision tokenizer of multimodal large models, allowing the vision tokenizer to discover and locate potential objects, which are then identified by the large language model. At the same time, this design fully leverages the spatial understanding capability of the vision tokenizer itself, without the need for external expert models (such as SAM) to assist in localization, thus avoiding the redundancy of external models.

Specifically, Groma, based on global image encoding, introduces region encoding to achieve localization functionality - as shown in the diagram below, Groma first uses Region Proposer to locate potential objects, and then encodes each located region into a region token through Region Encoder.

The large language model can then determine the semantics of the region token and achieve a visually grounded conversation by inserting region tokens in the output to create a hyperlink-like effect.

Similarly, user-specified regions can also be encoded into corresponding region tokens through Region Encoder and inserted into user commands, allowing multimodal models to focus on the specified regions and generate targeted responses. To enhance the robustness and accuracy of localization, Groma utilizes over 8M data (including SA1B) for pre-training the Region Proposer. As a result, its generated proposals encompass not only common objects but also the components of objects and broader background elements.

Furthermore, thanks to its modular design, Groma can employ high-resolution feature maps for the input of Region Proposer/Encoder and low-resolution feature maps for large model input, thereby reducing computational costs without sacrificing localization performance.

Experimental Results

Groma demonstrates superior performance on traditional Grounding Benchmarks compared to MiniGPT-v2 and Qwen-VL. At the same time, Groma validated its conversational and reasoning abilities in the multimodal large-scale model general VQA Benchmark (LLaVA-COCO). In visual comparisons, Groma also shows higher recall and fewer illusions. In addition, Groma also supports referential dialogue that integrates conversational and positioning capabilities, as well as grounded chat.

Thanks to the powerful cognitive reasoning abilities of large language models, multimodal large models excel in visual understanding tasks.

However, some traditional visual tasks, such as detection, segmentation, depth estimation, rely more on visual perception abilities, which happen to be lacking in large language models.

Groma offers a new solution to this issue by decoupling perception and cognition, with the vision tokenizer handling perception and the large language model handling cognition.

This form of perception before cognition not only aligns better with human visual processes but also avoids the computational cost of retraining large language models.

On May 15th, ByteDance just unveiled their self-developed Douyabao large model, providing multimodal capabilities, supporting over 50 businesses including Douyabao APP, Koushi, Jimo, and more, and opening it up to enterprise customers through the Volcano Engine to help businesses enhance efficiency and accelerate intelligent innovation. Currently, the Douyabao APP has become the largest AIGC application in the Chinese market. ByteDance is continuously increasing its investment in top talents and cutting-edge technologies, participating in industry-leading technical challenges and breakthroughs.

Project Website: https://groma-mllm.github.io Paper Link: https://arxiv.org/abs/2404.13013 Open Source Code: https://github.com/FoundationVision/Groma

pre：Tripling Energy Efficiency in Three Years! AMD Unveils Bold Chip Plan to Challenge NVIDIA

next：Established for nine years, what is the fundamental driving force behind Pinduoduo's continuous growth?

HKU and ByteDance propose a new paradigm of multimodal large models, simulating human perception before cognition

Navigation

Related Articles