Home > News > Internet

DouBao Large Model Reveals Evaluation Scores, 19% Improvement Over Previous Generation "Skylark"

Meng Jia Mon, May 27 2024 07:40 PM EST

Recently, the DouBao large model was officially released at the Volcano Engine Power Conference. While sparking a wave of large model price reductions with ultra-low prices, DouBao's model capabilities have also attracted industry attention.

In a product document from the Volcano Engine, the DouBao model team disclosed the results of an internal test: on 11 mainstream public evaluation sets in the industry such as MMLU, BBH, GSM8K, HumanEval, the total score of DouBao-pro-4k was 76.8 points. This marks a 19% improvement over the previous generation model Skylark2, which scored 64.5 points, and outperformed other domestic models tested during the same period.

This evaluation was completed in May of this year and included nine domestic large language models, including DouBao's general model-pro and Skylark2. Apart from Skylark2, the other models were the latest advanced versions released by various manufacturers, tested through API calls. S4a7995bf-b9fa-4ca7-b135-244d2e25f4e7.jpg Image: Internal Testing Results of DouBao Model Team

The evaluation results show that DouBao has improved by around 50% compared to the previous generation model on the evaluation datasets HumanEval and MBPP in assessing code capabilities. In the evaluation sets focusing on professional knowledge and instruction compliance, DouBao achieved performance improvements of 33% and 24% respectively, making it the top-scoring domestic model.

Additionally, the DouBao model demonstrated good performance in mathematical ability, language comprehension, and the comprehensive evaluation sets CMMLU and CEval, ranking in the top three. Across 11 publicly available evaluation sets, DouBao's universal model-pro scored 76.8 points. According to OpenAI's published test scores, GPT-4 scored 80.1 points across these evaluation sets, maintaining a certain leading edge over domestic models.

Reportedly, the DouBao model was just launched on May 15th and has not yet been included in third-party testing. It is expected that many third-party evaluation organizations will gradually disclose the model's evaluation results in the next one to two months. The AI chat assistant "DouBao," sharing the same name as the model, has already reached 26 million monthly active users as officially announced, allowing users to freely experience testing.

Previously, the Zhìyuán Research Institute released an evaluation report covering 91 global language models. In subjective evaluations focusing on Chinese language capabilities, Skylark2 ranked first, surpassing GPT-4 in Chinese proficiency. S070d7906-9395-45f4-957c-7aa7cb90e110.jpg Image: Evaluation Results of the Zhìyuán Research Institute Language Model (model version prior to April 20th)