Update on Large Model Blind Test Ranking! Yi-Large ranks among the top seven globally, Li Kaifu discusses the impact of price wars

Wed, May 22 2024 07:52 AM EST

7ca9d03207736feb5e1a462af688b7e1 Author ZeR0 edited by Mo Ying

On May 21st, Zhidongxi reported that today, the blind test results of the well-known large model arena LMSYS Chatboat Arena were updated. The domestic large model unicorn Zero One Thousand Objects' trillion-parameter closed-source large model Yi-Large ranked seventh in the latest overall rankings worldwide, first among Chinese large models, surpassing Llama-3-70B and Claude 3 Sonnet. In the Chinese language ranking, it is tied for first place with GPT-4o. The LMSYS Chatboat Arena, released by the third-party non-profit organization LMSYS Org, features blind test results based on real votes from over 11.7 million global users. A total of 44 models participated in this event, including the open-source large model Llama 3-70B and proprietary models from various major companies. The evaluation process of Chatbot Arena covers various aspects, from direct user voting to blind testing, to large-scale voting and dynamically updated rating mechanisms. These factors work together to ensure the objectivity, authority, and professionalism of the evaluations, providing a more accurate reflection of the performance of large models in practical applications.

Last week, OpenAI's test version of GPT-4o made its way into the Chatbot Arena leaderboard under the alias "im-also-a-good-gpt2-chatbot", surpassing a host of international leading models such as GPT-4-Turbo, Gemini 1.5 Pro, Claude 3.0pus, and Llama-3-70b. OpenAI CEO Sam Altman personally shared and referenced the test results from the LMSYS arena blind test challenge after the release of GPT-4o. According to the latest Elo ratings, GPT-4o leads the pack with a score of 1287, while GPT-4-Turbo, Gemini 1.5 Pro, Claude 3 Opus, Yi-Large, and others are in the second tier with scores around 1240.

The top 6 models are from major players like OpenAI, Google, and Anthropic. GPT-4, Gemini 1.5 Pro, and others are flagship models with trillion-level super-large parameter scales, while the rest are also in the range of several hundred billion parameters. Zero One Wanwu is the only Chinese large-scale model enterprise with its in-house model ranking in the top ten. It is placed fourth in the rankings following OpenAI, Google, and Anthropic. The Yi-Large model, with a parameter scale of only a trillion, ranks seventh with a score of 1236.

Following this, the scores of Bard (Gemini Pro), Llama-3-70b-Instruct, and Claude 3 Sonnet have declined to around 1200 points. Alibaba's Qwen-Max large model has an Elo score of 1186, ranking twelfth, while Zhipu AI's GLM-4 large model has an Elo score of 1175, ranking fifteenth.

To enhance the overall quality of Chatbot Arena queries, LMSYS has implemented a mechanism for removing duplicate data and has released a list after eliminating redundant queries. This new mechanism aims to eliminate excessively redundant user prompts, such as overly repeated "hello," which could affect the accuracy of the rankings. LMSYS has publicly stated that the list after removing redundant queries will become the default list in the future.

In the list after removing redundant queries, Yi-Large's Elo score has further improved, tying for fourth place with Claude 3 Opus and GPT-4-0125-preview. LMSYS Chatbot Arena blind test arena public voting address: https://arena.lmsys.org/ LMSYS Chatbot Leaderboard evaluation rankings (rolling updates): https://chat.lmsys.org/?leaderboard

Squeezing More Value out of a GPU, Li Kaifu Talks about the Impact of Big Model Price Wars

According to Dr. Li Kaifu, CEO of Zero One Million, to achieve outstanding results, the Yi-Large large model is less than 1/10 the size of Google and OpenAI's, and the GPU computing power used for training is less than 1/10 of theirs. A year ago, Zero One Million's GPU computing power was only 5% of Google and OpenAI's; while these top AI teams abroad have thousands of people, Zero One Million's model and infrastructure team totals less than a hundred.

"We can extract more value from the same GPU, which is an important reason why we can achieve these results today," Li Kaifu said. "If only evaluating trillion-parameter models, at least on this leaderboard, we are world number one. We are proud of these points. A year ago, we were 7 to 10 years behind OpenAI and Google in starting large model research and development; now, we are only about 6 months behind them, a significant reduction in the gap."

Why the rapid catch-up? Dr. Huang Wenhao, head of model training at Zero One Million, mentioned that every decision made by Zero One Million in model training has been correct, including spending a long time improving data quality, following scaling laws, and continuously improving data quality and scaling up.

At the same time, Zero One Million places great importance on infrastructure development. Algorithm infrastructure is a collaborative design process to maximize computing power effectively. In this process, its talent team integrates engineering, infrastructure, and algorithms.

Li Kaifu mentioned that Zero One Million hopes that from the smallest to the largest models, they can be the best in China. In the future, there may be smaller models released, all striving to be at the forefront in the same size category, excelling in code, Chinese, English, and many other aspects. There are various opportunities for smaller simple applications, and Zero One Million's strategy is to "not let any slip by."

He also noted the recent price war on large model APIs. Li Kaifu believes that Zero One Million's pricing is very reasonable and they are putting in great effort to further reduce prices.

"Is there a big difference between spending a dozen yuan or a few yuan for 1 million tokens? For large and difficult applications, I think we are the inevitable choice," he said. Zero One Million's API spans both domestic and international markets, and they are confident that it is a model with good performance and reasonable cost-effectiveness globally. "So far, the performance we just announced is definitely the best cost-effective option domestically. People may use a thousand tokens or a million tokens, you can calculate it yourself."

He believes that the entire industry will inevitably see a 1/10 reduction in inference costs annually, and today's API model usage rate is still very low. If more people can use it, it would be very beneficial news.

Li Kaifu believes that large model companies will not make irrational win-lose strategies. Technology is the most important factor, and if the technology is not good, relying solely on money to do business is not sustainable. If in the future China adopts such a strategy of losing everything rather than letting others win, then Zero One Million will target foreign markets.

Huang Wenhao shared that Zero One Million has not encountered a data shortage issue and sees a lot of potential in the data that can be explored. Recently, there have been some discoveries in multimodal data that can further increase the amount of data by one to two orders of magnitude. The idea of "weak data" contributes to the quality of model training and data diversity, coming from the Zero One Million team.

Yi-Large: Tied for First Place in the Chinese Ranking with GPT-4o, Ranked Second in Challenging Task Evaluation

Among domestic large model manufacturers, Zhipu GLM4, Alibaba Qwen Max, Qwen 1.5, Zero One Million Yi-Large, and Yi-34B-chat all participated in the blind test this time.

In addition to the overall rankings, LMSYS's language category has added evaluations in three languages: English, Chinese, and French. In the Chinese language sub-ranking, Yi-Large and OpenAI GPT-4o are tied for first place, with Qwen-Max and GLM-4 also ranking high. Programming ability, lengthy questioning, and the newly introduced "challenging prompts" are the three targeted rankings provided by LMSYS, known for their professionalism and high difficulty.

In the Programming Ability leaderboard, Yi-Large's Elo score surpasses Anthropic's flagship model Claude 3 Opus, ranking only below GPT-4o, tied for second place with GPT-4-Turbo and GPT-4. Yi-Large also ranks second globally on the Longer Query leaderboard, alongside GPT-4-Turbo, GPT-4, and Claude 3 Opus. The category of Hard Prompts includes prompts submitted by users from Arena, which are specially designed to be more complex, demanding, and strict.

LMSYS believes that these prompts can test the performance of the latest language models when facing challenging tasks. In this ranking, Yi-Large is tied for second place with GPT-4-Turbo, GPT-4, and Claude 3 Opus. 3. Entering the post-benchmark era, blind testing mechanisms provide a more impartial evaluation of large models

The topic of how to provide an objective and fair evaluation for large models has always been widely discussed in the industry. After last year's chaotic wave of large model evaluations, the industry has placed even greater emphasis on the professionalism and objectivity of evaluation datasets.

Platforms like Chatbot Arena, which can provide real user feedback, adopt blind testing mechanisms to avoid manipulating results, and continuously update the scoring system, not only can offer a fair evaluation for models but also ensure the authenticity and authority of evaluation results through extensive user participation.

Chatbot Arena released by LMSYS Org, with its innovative "arena" format and the rigor of the testing team, has become the globally recognized benchmark in the industry.

Google DeepMind's Chief Scientist, Jeff Dean, has cited ranking data from LMSYS Chatbot Arena to substantiate the performance of the Bard product. Andrej Karpathy, a founding member of OpenAI, praised, "Chatbot Arena is awesome." LMSYS Org, the organization behind the Chatbot Arena ranking, is an open research group founded in collaboration with students and faculty from the University of California, Berkeley, the University of California, San Diego, and Carnegie Mellon University.

Dr. Huang Wenhao, the head of the Zero-to-One Model Training at LMSYS, summarized that the evaluation mechanism of LMSYS is based on real user conversations, dynamically changing with random variations. No one can predict the distribution of questions, making it impossible to optimize the model for a single capability, thus enhancing objectivity. Moreover, since it is rated by users, the evaluation results are closer to user preferences in real-world applications.

Although the main members come from academia, LMSYS's research projects are closely aligned with industry. They not only develop large language models themselves but also provide various datasets to the industry (such as the authoritative evaluation set MT-Bench for instruction-following direction), evaluation tools, and develop distributed systems for accelerating large model training and inference, as well as providing computing power for online live large model competitions.

Chatbot Arena draws inspiration from the cross-comparison evaluation approach of the search engine era. It first randomly pairs all "contestant" models submitted for evaluation in an anonymous form to users. Then, it calls on real users to input their own prompts. Without knowing the model names, real users evaluate the responses of the two model products.

On the blind test platform at https://arena.lmsys.org/, large models are compared in pairs. Users independently input questions to the large models. Model A and Model B each generate real results for the two PK models. Users then vote by selecting one of four options below the results: Model A is better, Model B is better, they are tied, or both are not good. After submission, users can proceed to the next round of PK. 057f26a5c8f9b38974f73900544ea996 By crowdfunding real users for online real-time blind testing and anonymous voting, Chatbot Arena not only reduces the influence of bias but also maximizes the likelihood of avoiding the possibility of gaming the rankings based on the test set, thereby increasing the objectivity of the final scores. After cleaning and anonymizing the data, Chatbot Arena will make all user voting data public.

After collecting real user voting data, LMSYS Chatbot Arena also uses the Elo rating system to quantify the performance of models, further optimizing the scoring mechanism to ensure the objectivity and fairness of the rankings.

The Elo rating system is an authoritative evaluation system based on statistical principles, established by Hungarian-American physicist Dr. Arpad Elo, aimed at quantifying and assessing the competitive levels of various gaming activities. The Elo rating system plays a significant role in sports such as chess, Go, soccer, basketball, e-sports, and more.

In the Elo rating system, each participant receives a baseline rating. After each match, participants' ratings are adjusted based on the match results. The system calculates the probability of a participant winning a match based on their rating. If a lower-rated player defeats a higher-rated player, the lower-rated player will gain more points, and vice versa.

In conclusion: there are advantages to being a latecomer, and Chinese excel in product development compared to the U.S.

As large models enter commercial applications, the actual performance of models urgently needs to be rigorously tested in specific application scenarios. The entire industry is exploring a more objective, fair, and authoritative evaluation system. Major model manufacturers are actively participating in evaluation platforms like Chatbot Arena to prove the competitiveness of their products through real user feedback and professional evaluation mechanisms.

Kai-Fu Lee believes that the U.S. excels in groundbreaking scientific research, with a group of highly creative scientists. However, the intelligence, diligence, and hard work of the Chinese should not be underestimated. Closing the 7-10 year gap to just 6 months with Zero to One validates that excelling in model development is not just about publishing papers, inventing new things, or being first.

"The best performers are the strongest," in his view, there are advantages to being a latecomer, and the U.S.'s creativity is worth learning from. "But in terms of execution, creating a great user experience, developing products, and business models, I believe we are stronger than American companies."

Zero to One's enterprise-level model initially targets users abroad because the team believes that foreign users have a much greater willingness to pay or higher spending capacity compared to domestic users. Given the current situation in the domestic To B market, where one deal offsets another, which was prevalent in the early AI 1.0 era, the Zero to One team does not want to operate in such a manner.

"Today, looking at the model performance, we surpass other models, and we welcome competitors who disagree to challenge us at LMSYS to prove us wrong. But until that day comes, we will continue to assert that we have the best model," said Kai-Fu Lee.

pre：The first Ultra phone from vivo aims to set a new standard in imaging excellence.

next：CCF-Alibaba Mom Technology Bag Fund Officially Launched, First Phase Focuses on Large Model Direction

Update on Large Model Blind Test Ranking! Yi-Large ranks among the top seven globally, Li Kaifu discusses the impact of price wars

Navigation

Related Articles