Home > News > Techscience

Member Zhou Yuan: Shortage of High-Quality Chinese Language Data Hampers AI Development

ZhaoAnLi,BianGe Sun, Mar 10 2024 02:48 PM EST

During the 2024 National People's Congress and Chinese People's Political Consultative Conference (NPC & CPPCC), Member Zhou Yuan, founder and CEO of Zhihu, pointed out a significant challenge hindering the development of artificial intelligence (AI) in China: the shortage of high-quality Chinese language data.

While China has been closely following the international frontier in the field of large-scale AI models, it still faces several challenges. One of the most notable issues is the scarcity of high-quality Chinese language resources. 65ec5c8ae4b03b5da6d0afd2.jpeg Interview with Member Zhou Yuan: Addressing the Shortage of High-Quality Chinese Language Corpus

By the end of 2023, statistics indicate that China has over 200 companies and academic institutions engaged in the research and development of large-scale models with over 1 billion parameters. Currently, more than 20 large-scale model products have been approved to provide services to the public. However, Member Zhou Yuan believes that to a certain extent, the shortage of high-quality Chinese language corpus resources has constrained the development of artificial intelligence technology and the promotion of innovative applications in China.

"Within the training data for ChatGPT, the proportion of Chinese content is less than one-thousandth, while English content accounts for over 92.6%," Zhou Yuan stated. Despite the abundance of domestic data resources, the scarcity of high-quality Chinese data persists due to factors such as insufficient data mining and restrictions on free circulation in the market.

Zhou Yuan pointed out that due to the shortage of high-quality Chinese language corpus resources, many research institutions and enterprises engaged in large-scale model development in China have to rely on foreign annotated datasets, open-source datasets, or web crawling for data during model training. Therefore, he emphasized that addressing the shortage of high-quality Chinese language corpus data is crucial for promoting the high-quality development of China's large-scale model industry.

"When we examine the iterative development of large-scale models, we find that the shortage of Chinese text is even more apparent," Zhou Yuan likened high-quality Chinese language corpus to a "reservoir." Regarding whether the development of the large-scale model industry can improve the shortage of Chinese language corpus, Zhou Yuan expressed that it is a matter of "first building the reservoir, and then making reasonable use of it."

Furthermore, he explained that the largest corpus for large-scale models currently comes from the User-Generated Content (UGC) ecosystem, which includes knowledge, experiences, and insights uploaded by individuals. "I believe that the work of building the reservoir has not been given enough attention, instead, there is a focus on how to 'fetch water,' such as data crawling and content retrieval during model training, which may involve issues such as intellectual property rights and privacy security."

"The cycle of having computational power and models without good data is obviously flawed," Zhou Yuan stated, emphasizing that the shortage of corpus data will continue to be a particularly evident and serious issue in the coming years, requiring sufficient attention.

In response, Zhou Yuan proposed addressing the shortage of high-quality Chinese language corpus data from three aspects: establishing data compliance supervision mechanisms, strengthening data security and intellectual property protection, and accelerating the development and utilization of high-quality Chinese datasets for the large-scale model industry in China.

Specifically, he suggested that relevant authorities establish corresponding regulatory mechanisms for data compliance, promote the improvement of regulatory legislation for Artificial Intelligence Governance and Compliance (AIGC), and protect and regulate data compliance in the field of artificial intelligence. Additionally, efforts should be made to research and formulate management methods or regulations to ensure that the intellectual property rights and interests of data holders are fully protected, and to encourage and support enterprises and social entities with rich and high-quality data reserves and continuous production capacity. Moreover, accelerating the development and utilization of high-quality Chinese datasets involves standardizing data annotation, exploring data element transaction models, and opening up and sharing public data resources, requiring the participation and active response of various sectors of society.