Wenxin Yiyan Significantly Leads in Multiple Metrics, Tsinghua Official Report Reveals

Kew Tue, Apr 30 2024 08:11 PM EST

Recently, the Fundamental Model Research Center of Tsinghua University, in collaboration with the Zhongguancun Laboratory, released the March 2024 edition of the "SuperBench Comprehensive Capability Evaluation Report" for large models. This evaluation framework, named SuperBench, assessed 14 representative models from both domestic and international sources. The results indicate that Wenxin Yiyan 4.0 has performed exceptionally well, nearing the level of top international models. The gap has been closing, solidifying its status as a leading model in China.

For example, in the human alignment capability assessment, Wenxin Yiyan 4.0 performed excellently, ranking first in China. It led the field in Chinese reasoning and Chinese language evaluations, significantly outpacing other models. In Chinese comprehension, Wenxin Yiyan 4.0 had a clear advantage, leading the second-ranked GLM-4 by 0.41 points, while the GPT-4 series models performed poorly, ranking in the middle to lower range, and trailing the leader by more than 1 point.

In semantic understanding of mathematical ability, Wenxin Yiyan 4.0 tied for first place globally with Claude-3; the GPT-4 series ranked fourth and fifth, while other models scored around 55 points, significantly behind the leaders. In semantic understanding of reading comprehension, Wenxin Yiyan 4.0 surpassed GPT-4 Turbo, Claude-3, and GLM-4 to take the top spot.

In the safety evaluations most valued by companies when selecting large models, the domestic model Wenxin Yiyan 4.0 shone brightly, scoring the highest (89.1 points) and surpassing top international models like the GPT-4 series and Claude-3, with Claude-3 only ranking fourth. It's worth noting that Wernicke's Utterance not only excels in technical capabilities but also leads the way in application implementation. Since its launch on March 16th last year, Wernicke's Utterance has surpassed 200 million users, with daily API calls exceeding 200 million.

In the "Battle of the Giants" in 2023, domestic large-scale models fiercely clashed to determine the true leader. Despite the existence of multiple model performance evaluation rankings at home and abroad, their quality varies, resulting in significant differences in rankings. When consulting these rankings, it's crucial to refer to assessments from authoritative institutions and prestigious universities to make scientifically informed decisions when selecting large-scale models.

pre：Ginger Mispriced at $1 per Order, Resulting in Loss of $3 Million: Owner Closes Shop Seeking Refunds

next：The Seventh Season of "Duoduo Reading Month" Joins Forces with Over a Thousand Booksellers, Subsidizing over Ten Thousand Book Products for the First Time

Wenxin Yiyan Significantly Leads in Multiple Metrics, Tsinghua Official Report Reveals

Navigation

Related Articles