Domestic AdaPipe Technology for Optimizing Thousand-Calorie Cluster Training Released

ZhuHanBin Sat, May 11 2024 10:39 AM EST

Recently, at the important international computer architecture conference ASPLOS held in San Diego, USA, the domestically developed AdaPipe technology for optimizing thousand-calorie cluster training, independently developed by the Intelligent Computing Research Department of Peng Cheng Laboratory and Professor Chen Wenguang's team from the Department of Computer Science at Tsinghua University, was officially released.

In recent years, large-scale language models have demonstrated outstanding performance in various applications such as dialogue, question answering, and text summarization, attracting widespread attention from academia and industry. However, as large-scale language models evolve towards more parameters and longer texts, higher demands are placed on the storage and processing capabilities of computing devices.

Currently, traditional pipeline parallel training methods exhibit storage and computational load imbalances when dealing with models with hundreds of billions or trillions of parameters, directly impacting resource utilization and overall training efficiency. Additionally, due to insufficient high-speed memory capacity and communication capabilities in existing domestic computing cards, this issue becomes more prominent.

To address these challenges, the team led by Chen Wenguang developed the AdaPipe technology. This technology optimizes the recompute strategy by refining the granularity of recomputation based on specific model and hardware parameters. It also considers the differences in computational loads at various stages of training, further optimizing the recompute and pipeline segmentation strategies. This technology not only maximizes the utilization of storage resources but also ensures balanced distribution of computational loads across different computing nodes, significantly improving training efficiency.

Research indicates that AdaPipe supports mainstream acceleration cards such as GPUs and NPUs. When applied to training various models (such as Llama-2, GPT 3, etc.) on the domestically produced thousand-calorie cluster in the "Peng Cheng Cloud Brain II," it achieved over 20% performance improvement. Furthermore, in the actual training of the general large model (200B) with a 4K window and 3456 cards in the "Peng Cheng · Brain" project, AdaPipe achieved over 10% efficiency improvement. These cases will provide technical reserves and experiential references for optimizing training on future tens of thousands of domestically produced cluster cards.

The development of the aforementioned technology received support and funding from the National Natural Science Foundation and Peng Cheng Laboratory.

Related paper information: Link to Paper

pre：OpenAI CEO: Bullish on Humanoid Robots, Humans Won't Need to Master Computer Science in the Future

next：Taking Multiple Measures to Make Artificial Intelligence More Energy-Efficient

Domestic AdaPipe Technology for Optimizing Thousand-Calorie Cluster Training Released

Navigation

Related Articles