Multimodal CoT Thinking Chain Architecture Released, Now Open Source | From Xiamen University & Tencent Youtu

Thu, May 30 2024 08:29 AM EST

Multimodal large models now come with their own CoT thinking chain!

Xiamen University & Tencent Youtu team has introduced a decision-aware multimodal thinking chain architecture called "Cantor," which significantly boosts performance without the need for additional training. On ScienceQA, Cantor based on GPT-3.5 achieved an accuracy of 82.39%, showing a 4.08% improvement over the GPT-3.5-based chaining method.

On the more challenging MathVista platform, Cantor based on Gemini demonstrated a 5.9% increase in accuracy compared to the original Gemini model.

The Cantor paper is now available on arXiv, and the code has been open-sourced. (Link provided at the end of the document)

Multimodal Exclusive Chain of Thought

The Chain of Thought (CoT) is a widely used prompting technique that significantly enhances the reasoning abilities of large models by incorporating intermediate reasoning steps.

However, in visual reasoning tasks, models not only need to grasp the overall logic behind the questions but also need to analyze specific details by integrating image information.

This is where the Multimodal Chain of Thought comes into play.

Existing multimodal Chain of Thought approaches typically decompose questions into multiple related subtasks and sequentially invoke various external tools for processing.

However, due to the inadequacy of visual information and limitations of low-level perceptual tools, this paradigm faces potential "decision illusions" in decision-making and the challenge of low-level perceptual tools failing to provide advanced reasoning information.

The Cantor architecture endows Multimodal Large Language Models (MLLM) or Large Language Models (LLM) with the coordinating ability akin to a lead singer in a choir:

Initially, it enables MLLM or LLM to simultaneously process visual and textual contexts, leading to comprehensive understanding and decision-making awareness, thus avoiding decision illusions.

Subsequently, specific tasks are assigned to the "experts" played by MLLM to obtain advanced cognitive information to further assist in reasoning. In figure (a), the impact of different visual information on decision-making is illustrated:

Without visual context, asking GPT-3.5 about the maximum scale of a beaker would result in a decision illusion due to the lack of image information, prompting the need for more information.
Cantor based on LLM introduces visual context through captions, avoiding decision illusions and proposing reasonable solutions.
Cantor based on MLLM enhances visual context with images, further improving decision quality by making subtasks more specific.

Figure (b) compares different visual tools:

For subtasks related to object detection, traditional methods using low-level perceptual tools (such as detectors) can only obtain basic data (such as coordinates). These low-level clues require further integration to derive useful information, increasing the burden of reasoning.
High-level cognitive experts played by MLLM can directly obtain advanced reasoning information (such as the relative quantity relationships of objects), aiding in subsequent reasoning processes.

Decision Generation + Execution in Two Steps

Cantor's architecture consists of two main steps: decision generation and execution.

The former involves analyzing and decoupling the problem, combining various expert module features to generate reasonable decisions.

The latter calls upon various expert modules to execute subtasks, synthesizing information for consideration and generating the final answer.

The team specifically designed four expert modules:

TextIntel Extract: This module selectively extracts text from images as required, particularly useful for images containing a mix of text and graphic elements.
ObjectQuant Locator: This module identifies and locates objects in images, with advantages in comparing quantities and recognizing spatial relationships.
VisionIQ Analyst: This module processes and interprets visual data, adept at handling queries related to image content and analyzing images.
ChartSense Expert: This module specializes in analyzing and interpreting information in charts, extracting data points, understanding trends, and identifying key components such as titles, axes, labels, and legends in charts. The decision-making part involves using MLLM or LLM as decision generators, acting as decision brains. They first analyze the problem, combine the characteristics of various expert modules, allocate subtasks, and provide allocation reasons.

For example, when comparing the temperatures of two solutions, Cantor would first analyze the relationship between particle temperature and particle kinetic energy, with the expression for particle kinetic energy being 1/2mv^2. Combining image information and expert module characteristics, subtasks are assigned to TextIntel Extractor and ObjectQuant Locator:

Extract the mass and velocity of each particle in samples A and B.
Which sample has more particles?

This step has the following characteristics:

Initially, LLM or MLLM is used as a decision generator, acting as the decision brain.

Next, the team provides multiple expert modules to complete various types of subtasks, acting as the decision limbs. This integration ensures that decision-making is both comprehensive and detailed, making full use of the advantages of each module.

Subsequently, based on insights from the principle analysis, the decision generator customizes tasks for selected expert modules, enhancing Cantor's efficiency and performance through dynamic task allocation.

Execution is divided into modular execution and summary execution:

Modular Execution:

During this stage, Cantor calls various expert modules to complete the subtasks assigned in the decision-making stage to obtain supplementary information.

It is worth noting that the team only uses MLLM to play various expert modules to obtain advanced cognitive information to assist reasoning (such as the relationship between quantities and relative positions).

For example, for the subtasks assigned in the previous step, TextIntel Extractor and ObjectQuant Locator provide the following answers:

Sample A: Mass 44u, velocity 1,400m/s. Sample B: Mass 46u, velocity 1,400m/s.
The number of particles in both samples is the same.
Summary Execution:

During this stage, Cantor consolidates the information of subtasks and sub-answers, combines them with basic principles, and generates the final answer.

This includes three key aspects: first, by prompting MLLM or LLM to act as a knowledgeable and information-integrating answer generator, ensuring its expertise and ability to make basic judgments while integrating information effectively.

Second, for interpretability, displaying the model's thought process and enhancing its thinking ability, requiring it to first generate the basic principles for the answer and then produce the corresponding options.

Lastly, Cantor is required to maintain rationality and critical thinking, not solely relying on information obtained from module execution.

Surpassing Fine-Tuning Methods Without Training

Cantor is divided into two versions: Cantor (GPT-3.5) uses GPT-3.5 as the decision generator and answer generator, while Cantor (Gemini) uses Gemini Pro 1.0 for the same roles.

The team conducted experiments on two complex visual reasoning datasets, ScienceQA and MathVista.

Results from experiments on ScienceQA are as follows: The results show that using GPT-3.5 as the base language model for decision-making and answering, Cantor achieved an accuracy of 82.39%, which is a 4.08% improvement over GPT-3.5's prompted chaining of thoughts (CoT).

By utilizing Gemini as the decision and answer generator, Cantor achieved an accuracy of 84.96%, surpassing all zero-shot methods significantly, even outperforming fine-tuning methods like UnifiedQA (CoT) and MM-CoT.

The team further demonstrated the performance of Cantor in the IMG category of ScienceQA, where all questions in this category include image context. It can be seen that Cantor based on GPT-3.5 significantly outperforms the baseline on various tasks, even surpassing some well-known MLLMs such as SPHINX and LLaVA-1.5.

Cantor (Gemini) has also shown significant improvement in performance compared to the baseline. MathVista is a challenging dataset that integrates various mathematical reasoning tasks with visualization tasks.

The table above compares the performance of different methods. From general visual questions to specialized mathematical problems, Cantor significantly outperforms the baseline in almost all types of tasks.

This indicates that correct decision-making and modular expertise can inspire fine-grained, in-depth visual understanding and compositional reasoning abilities.

It is noteworthy that Cantor (GPT-3.5) even surpasses GPT-4 based on CoT and PoT.

The team further demonstrates specific examples comparing Gemini with Cantor (Gemini):

It can be seen that Cantor, through task allocation and allowing Gemini to role-play, achieves what was previously difficult to accomplish and correctly arrives at the answers.

It is worth noting that even though Gemini answers some questions correctly, its reasoning process is flawed, unlike Cantor, which does not exhibit this issue.

Paper link: https://arxiv.org/abs/2404.16033 Project link: https://ggg0919.github.io/cantor/

pre：Elon Musk's AI venture valued at $24 billion, LeCun fires back with personal attacks!

next：No NextLink

Multimodal CoT Thinking Chain Architecture Released, Now Open Source | From Xiamen University & Tencent Youtu

Multimodal Exclusive Chain of Thought

Surpassing Fine-Tuning Methods Without Training

Navigation

Related Articles