Home > News > AI

China's Large Model Takes the Lead: Yi-Large and GPT-4o Chinese Models Rank First in Global Blind Test Rankings

Thu, May 23 2024 07:58 AM EST

Synced

Synced Editorial Team

Last week, a mysterious model named "im-also-a-good-gpt2-chatbot" suddenly appeared in the large model arena Chatbot Arena, surpassing top models from various international giants such as GPT-4-Turbo, Gemini 1.5 Pro, Claude 3.0pus, and Llama-3-70b. Subsequently, OpenAI unveiled the mystery behind "im-also-a-good-gpt2-chatbot" — it turned out to be a test version of GPT-4o, with OpenAI CEO Sam Altman personally reposting and quoting the test results from the LMSYS arena blind test after the release of GPT-4o. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0521%2F44b66c78j00sdtp4u0026d000u000qdm.jpg&thumbnail=660x2147483647&quality=80&type=jpg The Chatbot Arena, released by the Large Model Systems Organization (LMSYS Org), has become the hot battleground for international giants like OpenAI, Anthropic, Google, Meta, and others. Using the most open and scientific evaluation methods, the arena has opened up public voting as large models enter their second year.

In the latest rankings released after a week, a dark horse story similar to "im-also-a-good-gpt2-chatbot" has unfolded once again. The model rapidly climbing the ranks this time is the "Yi-Large," a billion-parameter closed-source large model submitted by the Chinese large model company Zero One Thousand Objects.

In the latest rankings of the LMSYS blind test arena, Zero One Thousand Objects' latest billion-parameter model, Yi-Large, ranks 7th globally and first among Chinese large models, surpassing Llama-3-70B and Claude 3 Sonnet. In the Chinese language rankings, it is tied for first place with GPT4o.

Zero One Thousand Objects has thus become the only Chinese large model enterprise with its own model in the top ten of the overall rankings. In the top ten, the GPT series occupies four spots. In institutional ranking, Zero One Thousand Objects 01.AI follows closely behind OpenAI, Google, and Anthropic, officially advancing into the ranks of top international large model enterprises with the open gold standard.

The blind test results from the LMSYS Chatbot Arena, refreshed on May 20, 2024, show real voting numbers from over 11.7 million global users to date. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0521%2Fa9aac41aj00sdtp4u002jd000u000mhm.jpg&thumbnail=660x2147483647&quality=80&type=jpg It is worth mentioning that, in order to improve the overall quality of queries in Chatbot Arena, LMSYS has implemented a mechanism for removing duplicate data and has produced a list after eliminating redundant queries. This new mechanism aims to eliminate excessively redundant user prompts, such as overly repeated greetings like "hello." Such redundant prompts could impact the accuracy of the rankings. LMSYS has publicly stated that the list after removing redundant queries will become the default list in the future.

In the overall list after removing redundant queries, Yi-Large's Elo score has further improved, tying for fourth place with Claude 3 Opus and GPT-4-0125-preview. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0521%2Fdd96ea7fj00sdtp4u001xd000u000h5m.jpg&thumbnail=660x2147483647&quality=80&type=jpg LMSYS Chinese Ranking

GPT-4o and Yi-Large tied for first place

Of note to Chinese audiences is that among domestic large model manufacturers, Zhipu GLM4, Alibaba Qwen Max, Qwen 1.5, Zero One Wanwu Yi-Large, and Yi-34B-chat all participated in blind tests this time. In addition to the overall rankings, LMSYS has added language evaluations in English, Chinese, and French, signaling a focus on the diversity of global large models.

Yi-Large took the lead in the Chinese language sub-ranking, tying for first place with OpenAI's recently announced GPT4o, known as the strongest model on Earth for just a week. Qwen-Max and GLM-4 also performed exceptionally well in the Chinese rankings. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0521%2F159d27a9j00sdtp4u002ed000u000kvm.jpg&thumbnail=660x2147483647&quality=80&type=jpg "The Most Mind-Boggling" Public Evaluation

Yi-Large Ranks Second Globally

In the category rankings, Yi-Large also shines. The three evaluations of programming ability, long questions, and the newly introduced "difficult prompt words" are targeted lists provided by LMSYS, known for their professionalism and high difficulty, making it a public blind test of large models that can be called "the most mind-boggling."

In the programming ability ranking, Yi-Large's Elo score surpasses Anthropic's flagship model Claude 3 Opus, ranking only below GPT-4o, tied for second place with GPT-4-Turbo and GPT-4. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0521%2Fbf42b1f3j00sdtp4u001od000u000fpm.jpg&thumbnail=660x2147483647&quality=80&type=jpg On the Long Query Leaderboard, Yi-Large also ranks second globally, tied with GPT-4-Turbo, GPT-4, and Claude 3 Opus. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0521%2F1535361ej00sdtp4u001pd000u000ggm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Hard Prompts are a new category added to the rankings by LMSYS in response to community requests. This category includes prompts submitted by users from the Arena that are specially designed to be more complex, demanding, and rigorous. LMSYS believes that these prompts can test the performance of the latest language models when faced with challenging tasks.

On this leaderboard, Yi-Large's ability to handle hard prompts is also confirmed, tying for second place with GPT-4-Turbo, GPT-4, and Claude 3 Opus. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0521%2Fe232beecj00sdtp4u001yd000u000hzm.jpg&thumbnail=660x2147483647&quality=80&type=jpg LMSYS Chatbot Arena

The compass of the post-benchmark era

Providing an objective and fair evaluation for large models has always been a widely discussed topic in the industry. In order to achieve impressive evaluation scores from a fixed question bank, various methods of "boosting rankings" have emerged in the industry: mixing various evaluation benchmarks directly into the model training set, comparing unaligned models with aligned models, and so on. For those trying to understand the true capabilities of large models, it has indeed presented a scene of diverse opinions, leaving even the investors of large models uncertain.

After a series of complex and chaotic waves of large model evaluations in 2023, the industry has placed higher emphasis on the professionalism and objectivity of evaluation sets. The Chatbot Arena released by LMSYS Org, with its innovative "arena" format and the rigor of its testing team, has become the globally recognized benchmark, even leading OpenAI to anonymously pre-release and pre-test on LMSYS before the official launch of GPT-4o.

Among overseas tech executives, not only Sam Altman, but also Jeff Dean, Chief Scientist of Google DeepMind, has cited the ranking data from LMSYS Chatbot Arena to substantiate the performance of Bard products. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0521%2F9192d35cj00sdtp4u002sd000u000qcm.jpg&thumbnail=660x2147483647&quality=80&type=jpg OpenAI co-founder Andrej Karpathy even publicly stated that Chatbot Arena is "awesome." ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0521%2F68dea8e2j00sdtp4u002ad000u000mhm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Upon the release of their flagship model, it was promptly submitted to LMSYS, showcasing the great respect top overseas companies have for the Chatbot Arena. This respect stems from LMSYS being endorsed as an authoritative research organization and its innovative ranking mechanism.

Public information reveals that LMSYS Org is an open research organization founded through collaboration between students and faculty from the University of California, Berkeley, the University of California, San Diego, and Carnegie Mellon University. Despite its academic roots, LMSYS's research projects are closely aligned with industry. They not only develop large language models themselves but also provide various datasets to the industry (such as the authoritative evaluation set MT-Bench that follows instruction compliance), evaluation tools, and develop distributed systems to accelerate large model training and inference. They also offer the computing power needed for online live large model competitions.

In form, Chatbot Arena draws inspiration from the horizontal comparative evaluation approach of the search engine era. It first randomly pairs all "contestant" models submitted for evaluation in an anonymous manner for users to see. Subsequently, real users are called upon to input their own prompts without knowing the model names. Based on the responses from real users evaluating the two model products, users independently input questions to the large models on the blind test platform at https://arena.lmsys.org/. Model A and Model B each generate real results for the two PK models, and users vote by selecting one of four options below the results: A model is better, B model is better, both are equal, or neither is good. After submission, users can proceed to the next round of PK. 6b316d95g00sdtp4u01abd000k000a7m.gif By crowdfunding real users for online real-time blind tests and anonymous voting, Chatbot Arena not only reduces the influence of bias but also maximizes the likelihood of avoiding the possibility of gaming the rankings based on the test set, thus increasing the objectivity of the final scores. After cleaning and anonymizing the data, Chatbot Arena will also make all user voting data public. Thanks to the mechanism of "real user blind test voting," Chatbot Arena is hailed as the most user-centric Olympics in the big model industry.

After collecting real user voting data, LMSYS Chatbot Arena also utilizes the Elo rating system to quantify the performance of models, further optimizing the scoring mechanism to strive for a fair reflection of participants' abilities.

The Elo rating system, founded by Hungarian-American physicist Dr. Arpad Elo, is an authoritative evaluation system based on statistical principles aimed at quantifying and assessing the competitive levels of various gaming activities. As the internationally recognized standard for competitive level assessment, the Elo rating system plays a crucial role in sports such as chess, Go, soccer, basketball, esports, and more.

In simple terms, in the Elo rating system, each participant receives a baseline rating. After each game, the participants' ratings are adjusted based on the game's outcome. The system calculates the probability of a participant winning based on their rating; if a lower-rated player defeats a higher-rated player, the lower-rated player gains more points, and vice versa. By introducing the Elo rating system, LMSYS Chatbot Arena ensures the objectivity and fairness of its rankings to the greatest extent. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0521%2F46142c8dj00sdtp4u001zd000pe00oem.jpg&thumbnail=660x2147483647&quality=80&type=jpg The evaluation process of Chatbot Arena covers various aspects, from direct user voting to blind tests, to large-scale voting and dynamically updated rating mechanisms. These factors work together to ensure the objectivity, authority, and professionalism of the evaluation. Undoubtedly, this evaluation method can more accurately reflect the performance of large models in practical applications, providing the industry with a reliable reference standard.

Yi-Large competes with international top models

Topping the domestic large model blind test

In this Chatbot Arena, a total of 44 models participated, including top open-source models like Llama3-70B and proprietary models from various major companies. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0521%2F9720df01j00sdtp4u001jd000rs00e7m.jpg&thumbnail=660x2147483647&quality=80&type=jpg According to the latest Elo ratings, GPT-4o leads the pack with a score of 1287, followed by GPT-4-Turbo, Gemini 1.5 Pro, Claude 3.0pus, and Yi-Large with scores around 1240 in the second tier. Subsequently, Bard (Gemini Pro), Llama-3-70b-Instruct, and Claude 3 sonnet see a steep drop to around 1200 points.

It is worth noting that the top 6 models belong to overseas giants OpenAI, Google, and Anthropic, with Zero to One ranking as the fourth institution globally. Models like GPT-4 and Gemini 1.5 Pro boast trillion-level super-large parameter scales, while others are in the tens of billions parameter range. Yi-Large, with a parameter size in the mere hundreds of billions, swiftly climbed to the seventh spot globally upon release on May 13, placing it alongside flagship models from overseas tech giants. In the LMSYS Chatbot Arena rankings as of May 21, Alibaba's Qwen-Max large model scored 1186, ranking 12th, and Zhipu AI's GLM-4 large model scored 1175, ranking 15th.

Amidst the surge of large models entering commercial applications, the actual performance of these models urgently needs to be rigorously tested in specific application scenarios to demonstrate their true value and potential. The superficial "show-style" evaluation methods of the past are no longer meaningful. To foster the healthy development of the entire large model industry, the industry as a whole must strive for a more objective, fair, and authoritative evaluation system.

In this context, an evaluation platform like Chatbot Arena, which can provide real user feedback, employ blind testing mechanisms to prevent result manipulation, and continuously update its scoring system, is particularly crucial. It not only offers impartial evaluations for models but also ensures the authenticity and authority of evaluation results through extensive user participation.

Whether driven by considerations of their own model iteration capabilities or from a long-term reputation perspective, large model manufacturers should actively engage in authoritative evaluation platforms like Chatbot Arena to prove their product competitiveness through actual user feedback and professional evaluation mechanisms.

This not only helps enhance the brand image and market position of manufacturers but also promotes the healthy development of the entire industry, fostering technological innovation and product optimization. Conversely, manufacturers who opt for show-style evaluation methods and overlook real-world application effects will widen the gap between model capabilities and market demands, making it increasingly challenging to compete in the fierce market landscape.