Black Horse! Major Model Arena Rankings Updated, Domestic Player Enters Global Top 10 for the First Time

Thu, May 23 2024 08:17 AM EST

In the fiercely competitive major model arena, there was a sudden update today:

Yi-Large, a billion-parameter closed-source major model under the domestic major model company Zero One Thousand Objects, has risen to seventh place on the overall rankings, also becoming the top domestic major model on the list.

It can be seen that its performance is almost on par with GPT-4-0125-preview.

At the same time, the domestic Tsinghua-affiliated major model company, Intelligent Spectrum Huazhang's GLM-4-0116, also made it to the overall rankings, ranking 15th. This result comes from the real blind test voting of over 11.7 million global users.

Moreover, the Big Model Arena recently revised its rules, prohibiting further voting once a big model discloses its identity, eliminating the possibility of score manipulation.

Looking at the top 6 models before Yi-Large's ranking, 4 models are from GPT, 1 is Google's Gemini, and 1 is Anthropic's Claude.

Dr. Kai-Fu Lee, founder and CEO of Zero One AI, stated that LMSYS provides a third-party, impartial platform, which is highly recognized by other competitors.

Zero One AI's team size, parameter scale, and GPU computing power are all smaller than models ranked higher.

Yi-Large Soars in Rankings

The official account of the Big Model Arena also provided more achievements of Yi-Large:

In the Chinese category, Yi-Large and GLM-4, two domestic big models, performed well.

Among them, Yi-Large stood out, tying for first place overall with GPT-4o. The confidence interval for model strength is shown in the following figure: It is worth noting that, in order to improve the overall quality of queries in the Large Model Showdown, LMSYS has implemented a mechanism for removing duplicate data and has issued a ranking list after removing redundant queries.

This new mechanism aims to eliminate excessively redundant user prompts, such as overly repeated "hello," as such redundancy could affect the accuracy of the rankings.

LMSYS has publicly stated that the ranking list after removing redundant queries will become the default overall ranking in the future.

Currently, in the overall ranking after removing redundant queries, Yi-Large's Elo score has further improved, tying for fourth place with Claude 3 Opus and GPT-4-0125-preview.

To explain, the Elo rating system is based on statistical principles and is the internationally recognized standard for assessing competitive levels. In this rating system, each participant has a baseline rating, which is then adjusted after each match. When a lower-rated player defeats a higher-rated player, the lower-rated player gains more points, and vice versa.

The introduction of the Elo rating system by LMSYS is to ensure that the Large Model Showdown maintains the utmost objectivity and fairness in its rankings. In the rankings for different categories, Yi-Large also shines.

LMSYS provides three targeted rankings for programming ability, long questions, and the newly introduced "challenging prompts." These rankings are known for their professionalism and high difficulty, making them the most brain-burning public blind tests for large models today.

In the programming ability leaderboard, Yi-Large's Elo score surpasses Anthropic's flagship model Claude 3 Opus, ranking just below GPT-4o and tied for second place with GPT-4-Turbo and GPT-4. Yi-Large also ranks second globally on the Longer Query leaderboard, alongside GPT-4-Turbo, GPT-4, and Claude 3 Opus. The category of Hard Prompts was added to the leaderboard refresh today in response to community requests within the LMSYS.

These prompts are sourced from submissions by users in the Big Model Arena, specially designed to be more complex, demanding, and stringent.

The reason for adding this category to the leaderboard is the belief that such prompts can test the performance of the latest language models when faced with challenging tasks.

On this leaderboard, Yi-Large is tied for second place with GPT-4-Turbo, GPT-4, and Claude 3 Opus in handling difficult prompts. The impressive Yi-Large, a closed-source model released by Zero-One Thousand just a week ago, has shown outstanding performance.

In the official evaluation results, Yi-Large ranked first in both HumanEval and MATH for inference, surpassing top models like GPT-4, Claude3 Sonnet, Gemini 1.5 Pro, and LLaMA3-70B-Instruct in the current large model landscape. It is understood that the next step for Yi-Large is to adopt the Yi-XLarge with the MoE architecture, which has already started training.

The Arena of Large Models

The arena of large models, known as the Chatbot Arena, seems to have become the battleground for top large models today.

Previously, foreign models such as Google's Bard, OpenAI's mysterious large model gpt2-chatbot (not GPT-2), Mistral AI's Mistral Large, and others have been charging into battle on this platform.

Many domestic players have also been gradually putting their own models to the test.

Last year, the great god Capaxi praised the large model arena as "very Awesome": After the release of GPT-4o, OpenAI's CEO Ultraman also reposted and quoted the test results from the Big Model Arena blind test, exclaiming "Yay for the goose sister!" The Large Model Systems Organization (LMSYS Org), founded through a collaboration between students and faculty from the University of California, Berkeley, the University of California, San Diego, and Carnegie Mellon University, focuses on research projects closely aligned with industry despite its academic roots.

LMSYS not only develops large language models but also provides various datasets to the industry. Their MT-Bench dataset, known for its adherence to instruction-following guidelines, is considered an authoritative benchmark. Additionally, they have developed distributed systems to accelerate large model training and inference, offering the computational power required for online live large model competitions. The large model arena draws inspiration from the horizontal comparison evaluation approach of the search engine era.

It first randomly pairs all submitted models for evaluation, presenting them in an anonymous form to users.

Without knowing the model names, users input their prompts, and the results of two PK models, Model A and Model B, are generated on either side. Users then vote by choosing one of the following options below the results:

Model A is better/Model B is better/It's a tie/Both are not good.

After submitting their vote, users can proceed to the next round of PK. Currently, the evaluation process of the large model arena covers various aspects including direct user voting, blind tests, large-scale voting, and dynamically updated scoring mechanisms to ensure objectivity and professionalism in the results.

Official public data shows that in this updated large model arena, a total of 44 models are participating.

There are both open-source experts, such as Llama3-70B, as well as closed-source models from major global companies and startups.

Lastly, here is a win rate heat map covering all the large models in the current large model arena: Come and check out how well your picked large model is performing! (doge face)

Large Model Arena Blind Test Platform: https://arena.lmsys.org/ Large Model Arena Evaluation Leaderboard (constantly updated): https://chat.lmsys.org/?leaderboard

pre：Breaking News: Huge Discount on Tongyi Qianwen GPT-4 Level Large Model - 97% Off! Get 200 Million Tokens for Just $1

next：GPT-4 confirmed to possess "human-like cognition" in Nature! AI better at detecting irony and implications than humans

Black Horse! Major Model Arena Rankings Updated, Domestic Player Enters Global Top 10 for the First Time

Yi-Large Soars in Rankings

The Arena of Large Models

Navigation

Related Articles