Multimodal large models now come with their own CoT thinking chain!
Xiamen University & Tencent Youtu team has introduced a decision-aware multimodal thinking chain architecture called "Cantor," which significantly boosts performance without the need for additional training. On ScienceQA, Cantor based on GPT-3.5 achieved an accuracy of 82.39%, showing a 4.08% improvement over the GPT-3.5-based chaining method.
On the more challenging MathVista platform, Cantor based on Gemini demonstrated a 5.9% increase in accuracy compared to the original Gemini model.
The Cantor paper is now available on arXiv, and the code has been open-sourced. (Link provided at the end of the document)
The Chain of Thought (CoT) is a widely used prompting technique that significantly enhances the reasoning abilities of large models by incorporating intermediate reasoning steps.
However, in visual reasoning tasks, models not only need to grasp the overall logic behind the questions but also need to analyze specific details by integrating image information.
This is where the Multimodal Chain of Thought comes into play.
Existing multimodal Chain of Thought approaches typically decompose questions into multiple related subtasks and sequentially invoke various external tools for processing.
However, due to the inadequacy of visual information and limitations of low-level perceptual tools, this paradigm faces potential "decision illusions" in decision-making and the challenge of low-level perceptual tools failing to provide advanced reasoning information.
The Cantor architecture endows Multimodal Large Language Models (MLLM) or Large Language Models (LLM) with the coordinating ability akin to a lead singer in a choir:
Initially, it enables MLLM or LLM to simultaneously process visual and textual contexts, leading to comprehensive understanding and decision-making awareness, thus avoiding decision illusions.
Subsequently, specific tasks are assigned to the "experts" played by MLLM to obtain advanced cognitive information to further assist in reasoning. In figure (a), the impact of different visual information on decision-making is illustrated:
Figure (b) compares different visual tools:
Decision Generation + Execution in Two Steps
Cantor's architecture consists of two main steps: decision generation and execution.
The former involves analyzing and decoupling the problem, combining various expert module features to generate reasonable decisions.
The latter calls upon various expert modules to execute subtasks, synthesizing information for consideration and generating the final answer.
The team specifically designed four expert modules:
For example, when comparing the temperatures of two solutions, Cantor would first analyze the relationship between particle temperature and particle kinetic energy, with the expression for particle kinetic energy being 1/2mv^2. Combining image information and expert module characteristics, subtasks are assigned to TextIntel Extractor and ObjectQuant Locator:
This step has the following characteristics:
Initially, LLM or MLLM is used as a decision generator, acting as the decision brain.
Next, the team provides multiple expert modules to complete various types of subtasks, acting as the decision limbs. This integration ensures that decision-making is both comprehensive and detailed, making full use of the advantages of each module.
Subsequently, based on insights from the principle analysis, the decision generator customizes tasks for selected expert modules, enhancing Cantor's efficiency and performance through dynamic task allocation.
Execution is divided into modular execution and summary execution:
During this stage, Cantor calls various expert modules to complete the subtasks assigned in the decision-making stage to obtain supplementary information.
It is worth noting that the team only uses MLLM to play various expert modules to obtain advanced cognitive information to assist reasoning (such as the relationship between quantities and relative positions).
For example, for the subtasks assigned in the previous step, TextIntel Extractor and ObjectQuant Locator provide the following answers:
Sample A: Mass 44u, velocity 1,400m/s. Sample B: Mass 46u, velocity 1,400m/s.
The number of particles in both samples is the same.
Summary Execution:
During this stage, Cantor consolidates the information of subtasks and sub-answers, combines them with basic principles, and generates the final answer.
This includes three key aspects: first, by prompting MLLM or LLM to act as a knowledgeable and information-integrating answer generator, ensuring its expertise and ability to make basic judgments while integrating information effectively.
Second, for interpretability, displaying the model's thought process and enhancing its thinking ability, requiring it to first generate the basic principles for the answer and then produce the corresponding options.
Lastly, Cantor is required to maintain rationality and critical thinking, not solely relying on information obtained from module execution.
Cantor is divided into two versions: Cantor (GPT-3.5) uses GPT-3.5 as the decision generator and answer generator, while Cantor (Gemini) uses Gemini Pro 1.0 for the same roles.
The team conducted experiments on two complex visual reasoning datasets, ScienceQA and MathVista.
Results from experiments on ScienceQA are as follows: The results show that using GPT-3.5 as the base language model for decision-making and answering, Cantor achieved an accuracy of 82.39%, which is a 4.08% improvement over GPT-3.5's prompted chaining of thoughts (CoT).
By utilizing Gemini as the decision and answer generator, Cantor achieved an accuracy of 84.96%, surpassing all zero-shot methods significantly, even outperforming fine-tuning methods like UnifiedQA (CoT) and MM-CoT.
The team further demonstrated the performance of Cantor in the IMG category of ScienceQA, where all questions in this category include image context. It can be seen that Cantor based on GPT-3.5 significantly outperforms the baseline on various tasks, even surpassing some well-known MLLMs such as SPHINX and LLaVA-1.5.
Cantor (Gemini) has also shown significant improvement in performance compared to the baseline. MathVista is a challenging dataset that integrates various mathematical reasoning tasks with visualization tasks.
The table above compares the performance of different methods. From general visual questions to specialized mathematical problems, Cantor significantly outperforms the baseline in almost all types of tasks.
This indicates that correct decision-making and modular expertise can inspire fine-grained, in-depth visual understanding and compositional reasoning abilities.
It is noteworthy that Cantor (GPT-3.5) even surpasses GPT-4 based on CoT and PoT.
The team further demonstrates specific examples comparing Gemini with Cantor (Gemini):
It can be seen that Cantor, through task allocation and allowing Gemini to role-play, achieves what was previously difficult to accomplish and correctly arrives at the answers.
It is worth noting that even though Gemini answers some questions correctly, its reasoning process is flawed, unlike Cantor, which does not exhibit this issue.
Paper link: https://arxiv.org/abs/2404.16033 Project link: https://ggg0919.github.io/cantor/