Yun Tianli Fei Yu Xiao: Dissecting the Evolution and Challenges of Large Model Technology, Algorithm Chipification Breaks Through the "Triangle Constraint" of Large Model Implementation | GenAICon 2024

Wed, May 29 2024 07:53 AM EST

86a05aa57a0275313f59d620fa9dfd26 GenAICon 2024 Author: GenAICon 2024

The 2024 China Generative AI Conference was held in Beijing on April 18-19. At the main venue on the first day of the conference, Yu Xiaotian, the technical lead of the large model technology "Cloud Sky Book" at Yun Tian Li Fei, delivered a speech titled "Exploration of the Evolution and Practical Applications of Multi-Modal Large Model Technology."

At the end of 2022, ChatGPT emerged, sparking a development frenzy in the AI industry. In early 2024, the large video model Sora was introduced, propelling the development of AGI (Artificial General Intelligence) into the fast lane. Yu Xiaotian showcased the astonishing iterative speed and potential of AI technology through the release of Sora and cases like the U.S. using brain-machine interfaces to help paralyzed patients achieve autonomous movement. He believes that large model technology has transitioned from its nascent stage in previous years to its current peak, signifying humanity's entry into a new era of AI and advancing rapidly towards AGI.

In the new era of flourishing AI development, large model technology has become a focal point in the AI field. Large models centered around the Transformer architecture are considered efficient and scalable learners, capable of quickly learning and compressing vast amounts of data. However, the development of large model technology still faces challenges, with insufficient data support being a prominent issue.

How can this challenge be overcome? Yu Xiaotian believes that the key lies in cultivating top AI talents, as top talents and experts are the cornerstone supporting the rapid development of large model technology.

As a significant development direction of large model technology, multi-modal large models have also garnered widespread attention. Yu Xiaotian mentioned that the information compression strategies of multi-modal large models mainly fall into two types: hierarchical alignment structure and end-to-end alignment structure. The former leverages the extensive coverage advantage of textual data to accelerate learning convergence, while the latter achieves efficient information compression by cross-concurrently processing various information. However, the practical application of multi-modal large models faces numerous challenges.

Against this backdrop, how will Yun Tian Li Fei break the "triangular constraint" of large model applications and provide new possibilities for the application of large model technology across various industries?

The following is an excerpt from Yu Xiaotian's speech:

In reviewing recent significant events in large model technology, such as model releases and increased computing power, I have observed two crucial pieces of information: first, the astonishing iterative speed of AI technology, with tech giants worldwide vying for leadership positions; second, the AI field centered around large model technology is undergoing unprecedented rapid development, with this acceleration continuing to rise.

We have outlined three visual examples.

Firstly, at Tesla's Investor Day last year, they showcased a video demonstrating a humanoid robot attempting to assemble machinery. This hints at the possibility of us entering an era where robots manufacture robots.

Secondly, OpenAI recently launched a project named Sora and collaborated with Figure to develop intelligent robots with humanoid appearances. These robots possess high interactivity, can communicate smoothly with humans, and execute human commands.

Additionally, last month, the U.S. saw its first case of a paralyzed patient using their thoughts to tweet and even play games in the middle of the night using a brain-machine interface. These astounding applications demonstrate the immense potential brought by AI technology, indicating that humanity has entered a new era of AI.

From ChatGPT to Transformer, the evolution of large models in information compression and learning

The foundation of ChatGPT revolves around the evolution of the Transformer structure. But what is the Transformer? We consider it to be an efficient and scalable learner for massive data. In simple terms, it is an information compression mechanism that can compress all human knowledge in history in a short time and discover language patterns from it.

The structure of GPT does not favor any specific domain or modality; it can compress various knowledge and multiple modalities. Key conditions for this information compression include a massive parameter scale, powerful computing capabilities, and extensive data support. The parameter scale has developed to the level of billions or trillions. In terms of computing power, NVIDIA has continuously provided robust support.

However, from a data perspective, some scholars suggest that in the future, data may not be sufficient to support the training of large models. What can be done then? Perhaps using data synthesis, using large models to generate more data for game-based learning.

We believe that a core foundation of large models is talent, top AI talent. This talent can organically combine large parameters, computing power, and data to form efficient information compression under a true algorithmic structure. These top expert talents are the cornerstone supporting the rapid development of large model technology in the United States. 993637e614f495ebc3f24a93c99f77b1 The development of large-scale model technology and its capabilities can be summarized as compressing massive amounts of information and learning the statistical regularities within. Currently, in the text domain, we can compress vast amounts of data to extract patterns, leading to language understanding and generation. Similarly, video, images, sound, and other modalities can also be compressed through extensive data processing. By training on millions of hours of video data, we can eventually perceive and comprehend the world through our eyes, even enabling various modalities of data interaction in the future. This naturally leads us to the next topic of discussion - multimodal large-scale models.

How do multimodal large-scale models compress information? We believe there are two main types.

The first type involves hierarchical alignment structures. In this approach, the information is compressed in stages, with the first stage focusing on compressing textual information and subsequent stages compressing other modalities such as visual and auditory data. 66597140235205c4d263664671d79f84 Why do this? Because text data has broader coverage, is comprehensive, and is high in knowledge content. Based on this, faster learning convergence can be achieved. This can be likened to humans, where the three most important sources of learning are speaking, seeing the world with eyes, and hearing things with ears. These three pieces of information may initially guide the educational process, leading to the individual expressing themselves, which is the core essence of phased alignment. Looking at the diagram, the LLM Backbone focuses on language alignment as its core, phasing in the compression of multimodal large models to find patterns in the information.

The second major type is end-to-end alignment structure. This involves simultaneously learning from different modalities of data, inputting various types of information such as images and text, directly crossing and processing all information, compressing all information, extracting patterns within, ultimately achieving an understanding of this world, and driving interaction with it. 60e462661877e2fff3eaec03749f3109 Three, interpreting the three stages of large model technology development, the stage of scene feedback technology still faces challenges.

What can large models do for us? It is clear that large model technology is just a tool. From a technical perspective, it helps us compress a lot of information and more efficiently identify patterns within it. For large models to demonstrate value, they need to be implemented in various industries and business applications.

Drawing from the development path of AI technology, we believe that the development of large model technology can be defined in three stages. These three stages are actually related to the relationship between technology, data, and algorithms. At the outset, when designing algorithms, we typically validate them using a small amount of data, a process known as technology seeking scenarios. In the second phase, where scenarios feed back into technology, we employ more data to enhance the capabilities of algorithms and technology. The third phase, technology seeking scenarios, envisions all applications and needs being addressed by the same algorithm or model, marking our entry into the era of Artificial General Intelligence (AGI).

So, how is the progress of technology seeking scenarios now? We have traversed the path of applying technology seeking scenarios in the development of large-scale model technologies. Many applications, such as intelligent question answering, text generation, and single-point applications of generative large models like ChatGPT, have already demonstrated the application and algorithm maturity of large-scale model technologies.

Currently, we are in the second phase, where scenarios feed back into technology. We can observe that the implementation of multimodal large models still has a long way to go, with the complexity of industry scenarios posing a significant challenge. While we aspire to widely apply large models across various industries, the depth of knowledge required by industry scenarios presents a stern test for the capabilities of large models, with situations where standards are inconsistent and far from meeting demands.

Hence, there is a need to actively promote the implementation of multimodal large model technologies and address challenges to find solutions.

IV. How to break the "triangular constraint" of large model applications? CloudWalk proposes "algorithm chipification."

What are the key variables to consider? From urban governance in smart cities to intelligent transportation, we deduce that to realize the implementation of multimodal large models, attention should be paid to the "triangular constraint" of three variables. bc30c81e16323ba79816f29bcd4f1dfa Today, conversational systems are receiving a lot of attention. Their accuracy is approaching human levels, but they still face challenges in providing deep domain support and industry value due to the limitations of large models. Many tasks in real production environments are complex, and the lack of domain-specific expertise and the increasing complexity of data optimization structures pose challenges in terms of cost and efficiency for large models.

Therefore, we need to find a balance between accuracy, cost, and efficiency to drive the practical application of multimodal large models in conversational systems. We believe we are actively addressing this issue and working with colleagues in the AI field to continuously advance technological progress.

How has Yuncong Li Fei broken through the "triangle constraint"? Let me share our solution.

Since its establishment in 2014, Yuncong Li Fei has defined a technological development path called "algorithm chipization." Algorithm chipization involves not only applying algorithms to chips but also requires highly specialized talent. It necessitates experts who deeply understand algorithms, have professional knowledge of different scenarios and industries, and can collaboratively design algorithms with scenarios, ultimately reflected in the chip side, operator side, including advancing scalable instruction sets, optimizing computing architectures, and toolchain optimization.

This technological support enables us to apply various algorithms, including Transformers and various deep learning algorithm frameworks. Importantly, its cost and efficiency are crucial for the practical application of multimodal large models.

The multimodal large models developed by Yuncong Li Fei in the past encompassed several dimensions, including language, computer vision, text question answering, object detection, and segmentation. The implementation of these large models followed a strategy of hierarchical decoupling. By designing an algorithm chipization platform, we built a universal large model. This universal large model possesses foundational capabilities, scoring around 60 to 70 in industry knowledge and scenario experience, but excelling in universality, scoring 80, 90, or even full marks.

Moving forward, industry-specific large models and scenario-specific large models require optimizations at the operator level to achieve high scores in specific business scenarios, necessitating low-cost algorithm optimizations and efficient iterative training with edge-side data to meet customer demands.

Over the past decade, Yuncong Li Fei's algorithm research has undergone long-term iterative development. From research on ResNet convolutional neural networks before 2017 to the emergence of Transformer structures, we were among the first to adapt Transformer structures to the entire algorithm chipization platform. After the company went public last year, we increased our investment in large model technology research and continued to follow advanced technologies at home and abroad. We successfully developed language multimodal large models ranging from tens of billions to hundreds of billions.

Last month, we released the 3.5V large models of Yuncong Li Fei. These models demonstrate remarkable performance in tasks such as image-text comprehension, generation, and question answering. In the field of language large models, we have consistently ranked first on authoritative lists last year.

How did Yuncong Li Fei achieve remarkable results? Behind this are four key technologies.

How did we achieve these remarkable results? Despite facing numerous challenges, we have identified four key points worth sharing:

First, addressing cost issues. While accuracy can be improved through data accumulation, inference costs are unavoidable in practical applications. Our core focus is on solving the problem of efficient inference engines.

To this end, we independently developed the Space inference engine, which efficiently integrates with the operator layer, achieving lossless inference and increasing inference speed by over 50%. For instance, in generative large models, typically, single-character forward prediction is performed. However, we found a way to predict multiple characters at once while maintaining lossless and unchanged accuracy. In this scenario, by improving the algorithm structure, we achieved simultaneous prediction of multiple entries, thereby enhancing inference efficiency.

Second, reducing core costs. We strive to enhance efficiency and reduce GPU storage requirements. Through research on distributed chunking, including adaptive sparse cache decoding technologies, we successfully reduced GPU demands by 50%.

Third, optimizing training techniques. Optimizing training is the foundation for the practical application of large models, upon which all applications are built. We developed a scalable large model training technique. In simple terms, when training a large model, can the expanded parameters or scale be reused and optimized without additional costs?

The answer is yes, and this method also saves training costs. From the perspectives of depth and breadth, by reusing trained parameters to achieve depth and width expansion, training efficiency is doubled, while training costs are reduced by up to 50%.

Fourth, neural network processors and inference chips have been our focus over the past decade. Through four generations of iterations, from the first NNP100 to the current NNP400T, we have fully adapted to various deep learning architectures. Particularly under the Transformer architecture, we optimized instruction sets, cooperatively designed operators, and efficiently collaborated to support efficient inference of Transformer structures. Additionally, we were among the first companies to use Chiplet structures to adapt large models. Using these four core technologies, we have built an algorithm chip system that supports large models at the edge. Our underlying technologies support neural network processors and self-developed inference chips, promoting domestication processes, avoiding reliance on the supply chain, and enabling the operation of multimodal large models. From the perspective of industry applications, we have large models tailored for industry-to-edge scenarios. More importantly, we support users to perform imperceptible online fine-tuning while protecting user data privacy, all at an extremely low cost.

Six, achieving efficient inference of 30 words per second, multimodal large models have been implemented at the G end

Cloud TianTian's multimodal large models excel in text comprehension and generation, achieving an efficient inference speed of 30 words per second and handling over 450,000 words of context. By specifying requirements, it can rapidly generate notifications, resolutions, and other documents in specific formats, effectively driving office automation. The entire generation process is remarkably concise and fast.

Furthermore, we also support article modification and refinement with reference content, enabling efficient polishing and modification of existing content to meet specific needs. This aspect has already been successfully implemented in multiple prefecture-level cities and provincial departments. Leveraging our multimodal large models for office empowerment is highly flexible for generating project reports.

Lastly, in terms of text comprehension and generation, the quality of the generated content is crucial. We provide built-in proofreading functions that allow for multiple optimizations of the content after generation, achieving self-iteration and evolution effects. Cloud TianTian's multimodal large models support the understanding and generation of video data. After the training process of many data ends, some data require optimization and editing, especially in consumer scenarios such as image editing and 3D data synthesis.

Through our multimodal large models, we can synthesize data to obtain the desired 3D data. For image data comprehension, instructions can be used for rendering and editing the entire image, allowing the large model to understand and manipulate images based on instructions, even creating different styles. The Agent capability of multimodal large models, such as open-source object detection, provides support for urban development. As mentioned earlier, we have released an AI model box aimed at promoting the application of AI technology in urban areas, including support based on multimodal large models.

We are honored to be in this era of thriving AI development, leading various industries through continuous transformation. Today, AI large model technology is flourishing across various industries, and we hope to work hand in hand with experts and friends from all walks of life to jointly lead the implementation of multimodal technology and move towards the direction of AGI.

The above is the complete compilation of Yu Xiaotian's speech content.

pre：The Hottest Education Apps in the U.S. from Homework Help and Byte

next：Adapting to Multiform Multi-Tasking: The Birth of the Most Powerful Open-Source Robot Learning System "Octopus"

Yun Tianli Fei Yu Xiao: Dissecting the Evolution and Challenges of Large Model Technology, Algorithm Chipification Breaks Through the "Triangle Constraint" of Large Model Implementation | GenAICon 2024

Navigation

Related Articles