Home > News > Techscience

First Large-Scale Model Supporting 30 Dialects for Mixed-Speech Recognition Released

ZhaoAnLi Tue, May 28 2024 11:00 AM EST

"Could you please check the phone bill?" "Can you help me check the phone bill?" China Telecom's Wanhao Intelligent Customer Service receives millions of calls every day, many of which are in dialects. Especially for elderly individuals who are accustomed to speaking in dialects and only know how to communicate in dialects, both intelligent and human customer service representatives struggle to understand them, significantly reducing communication efficiency.

Recently, the TeleAI Institute of China Telecom unveiled the industry's first large-scale model for mixed-speech recognition supporting 30 dialects freely spoken together. Named the "Xingchen Super Multidialect Speech Recognition Model," it breaks the limitation of single models only recognizing specific individual dialects. This model can simultaneously recognize and understand over 30 dialects, including Cantonese, Shanghainese, Sichuanese, Wenzhounese, and more, making it the domestic model supporting the most dialects for speech recognition.

Statistics show that approximately 20% of the population in China still does not have widespread proficiency in Mandarin. They are often isolated from intelligent information services, making it challenging for them to enjoy the convenience of the AI era. Building a high-quality dialect database is fundamental for dialect preservation and research. Currently, the TeleAI Institute of China Telecom has established a high-quality dialect database comprising over 30 dialects and more than 300,000 hours of recordings, placing it at the forefront of the industry in terms of richness and quality of dialect databases.

The TeleAI Institute of China Telecom adheres to independent research and innovation and has introduced the Xingchen Speech Large-Scale Model. By leveraging massive-scale speech pre-training and multi-dialect joint modeling, it has become the first to achieve mixed-speech recognition supporting 30 dialects freely spoken with a single model. This model covers the most dialects domestically and serves the largest population among speech recognition models.

According to Li Xuelong, Chief Technology Officer of China Telecom and Director of the TeleAI Institute, the research team pioneered the "distillation + expansion" joint training algorithm, addressing the collapse issue of pre-training under massive-scale multi-scenario datasets and large-scale parameter conditions, achieving stable training of a 1-billion-parameter 80-layer model. Additionally, the Xingchen Speech Large-Scale Model is the industry's first open-source speech recognition model based on discrete speech representations. Its modeling paradigm, transitioning from speech to tokens and then to text, reduces the bit rate for speech transmission during inference by tens of times.

AI is injecting new vitality into the inheritance of language and culture. After collecting, recording, and summarizing dialectal language data, it is crucial to ensure that it can be correctly understood by future generations. With China's vast territory and diverse dialects, traditional dialect research relies on subjective perception and annotation by investigators, requiring significant human effort and posing challenges for systematic annotation. AI, on the other hand, can efficiently and systematically organize and summarize dialects, holding significant importance for dialect preservation and inheritance.

The Xingchen Speech Large-Scale Model has been widely implemented. It has been piloted in China Telecom's Wanhao Intelligent Customer Service in Fujian, Jiangxi, Guangxi, Beijing, Inner Mongolia, and other regions. After integrating the Xingchen model, the Wanhao Intelligent Customer Service can instantly understand 30 dialects, handling approximately 2 million calls daily. The Yisheng platform for intelligent customer service has integrated the Xingchen model for speech understanding and analysis capabilities, achieving full coverage across all 31 provinces and handling 1.25 million customer service calls daily. Furthermore, the Xingchen Speech Large-Scale Model has also been implemented in various cities' 12345 platforms.