Home > News > Hardware

Who gives you the confidence to use CPUs for AI inference?

Jin Lei Meng Chen Wed, Mar 27 2024 08:45 AM EST

During the training phase of large models, we opt for GPUs. However, when it comes to inference, we decisively add CPUs to the mix. QuantBit has observed during recent discussions with numerous industry professionals that many of them are starting to convey the viewpoint mentioned above. Coincidentally, in its official optimization tutorials, Hugging Face also has several articles aimed at "How to Efficiently Perform Inference with Large Models on CPUs": S6718fcf5-2b99-4b49-8ed5-33ce176f17e1.png Upon closer examination of the tutorial content, it becomes evident that this method of accelerating inference using CPUs encompasses not only large language models but also extends to multimodal large models, including those for image and audio processing. S6cb17483-ec00-494f-a541-18a20369fd5d.png Not only that, even mainstream frameworks and libraries like TensorFlow and PyTorch have been continuously optimizing, providing optimizations for CPUs and efficient inference versions.

So, while GPUs and other dedicated acceleration chips dominate the world of AI training, CPUs seem to have carved out a "path" in inference, including large model inference, and the related discussions around it have surprisingly gained momentum. S8e0f945b-5997-4ba8-8d3e-4e8da90e4907.png The situation you're describing is closely tied to the trends in the development of large-scale models. Since the advent of ChatGPT sparked the AIGC craze, both domestic and international players initially focused on training, leading to a bustling scene of massive model battles. However, once the training phase concluded, major models swiftly transitioned to the application stage.

Even NVIDIA, in its latest quarterly financial report, indicated that AI inference now accounts for 40% of its $18 billion data center revenue. This illustrates how inference is gradually becoming the main theme in the deployment process of large models.

Now, why choose CPUs for inference? To answer this question, let's reverse-engineer from the results and see how those "players" who have deployed CPUs for AI inference are faring.

Let's welcome two heavyweight contenders: JD Cloud and Intel.

This year, JD Cloud launched a new generation of servers featuring the fifth-generation Intel Xeon Scalable processors. First, let's take a look at the CPUs powering this new server.

If we were to describe this latest generation of Intel Xeon Scalable processors in one sentence, it might be that they have an increasingly AI-centric flavor. Compared to the previous generation, which utilized the same built-in AI acceleration technology (AMX, Advanced Matrix Extensions), these processors deliver a whopping 42% improvement in deep learning real-time inference performance. Moreover, compared to the previous generation with DL Boost (Deep Learning Boost) and its predecessor, the third-generation Xeon Scalable processors, the AI inference performance has increased by up to 14 times.

Now, let's delve into the two phases that Intel Xeon Scalable's built-in AI accelerator has undergone:

In the first phase, optimization for vector operations began. Starting with the introduction of the Advanced Vector Extensions 512 (Intel AVX-512) instruction set in the first-generation Xeon Scalable processors in 2017, vector operations were able to execute multiple data computations using a single CPU instruction. Then, with the second and third generations introducing Vector Neural Network Instructions (VNNI), which are at the core of DL Boost, the three separate instructions for multiply-accumulate operations were further merged, enhancing the utilization of computing resources and better utilizing high-speed caches while avoiding potential bandwidth bottlenecks. S3bcc2cb3-5363-4c03-8a6e-01346abf9551.png In the second stage, which is the current phase, the focus is on optimizing matrix operations. So, starting from the 4th generation Intel Xeon Scalable processors, the spotlight shifted to Intel Advanced Matrix Extensions (Intel AMX) as the built-in AI acceleration technology. It's specifically tailored for optimizing the most common matrix multiplication operations in deep learning models, supporting common data types like BF16 (for training/inference) and INT8 (for inference).

Intel AMX mainly consists of two components: specialized Tile registers to store large amounts of data, coupled with TMUL acceleration engines for executing matrix multiplication operations. Some liken it to Tensor Cores embedded in the CPU, which is quite apt.

With this setup, it not only manages to compute larger matrices in a single operation but also ensures scalability and scalability.

Intel AMX resides on each core of the Xeon CPU and is located near the system memory. This reduces data transfer latency, increases data transfer bandwidth, and reduces the complexity in practical usage.

For instance, if a model with no more than 20 billion parameters is "fed" to the 5th generation Intel Xeon Scalable processor, the latency will be reduced to less than 100 milliseconds! S33cd133e-3d49-4755-85d4-66e1d32b8d51.png Next, let's take a look at the new generation of JD Cloud servers.

According to reports, JD has partnered with Intel to customize and optimize the inference performance (token generation speed) of the 5th generation Intel® Xeon® Scalable processors with the Llama2-13B, resulting in a 51% increase. This enhancement is sufficient to meet the demands of various AI scenarios such as question answering, customer service, and document summarization. S059df41a-782f-4af1-9f36-82e9bb780d38.png △ Llama2-13B Inference Performance Test Data

Even for larger parameter models like the 70B Llama2, the 5th generation Intel Xeon Scalable processors are still up to the task. This demonstrates that the evolution of CPU-embedded AI accelerators has progressed to the point where they can reliably meet real-world inference demands in terms of performance. AI acceleration solutions built on general-purpose servers, like this, not only serve inference needs but also flexibly cater to requirements in data analysis, machine learning, and other applications. To exaggerate a bit, a single server can serve as a platform for AI applications with end-to-end support. Moreover, leveraging CPUs for AI inference comes with inherent advantages such as cost-effectiveness and, more importantly, efficiency in deployment and implementation. Since CPUs are standard components in computers, almost all servers and computers come equipped with them, and there already exist numerous CPU-based applications in traditional business environments. This means opting for CPUs for inference is both easily accessible and eliminates the need for importing designs of heterogeneous hardware platforms or having specialized talent reserves. Additionally, it's easier to obtain technical support and maintenance. Taking the healthcare industry as an example, CPUs have been widely used in electronic medical record systems and hospital resource planning systems in the past, leading to the cultivation of mature technical teams and the establishment of comprehensive procurement processes. Building upon this foundation, leading healthcare information technology company Weining Health has utilized CPUs to develop the WiNEX Copilot deployment solution, which is efficient and low-cost. This solution has been deeply integrated into Weining's next-generation WiNEX products. Any hospital already using this system can quickly deploy this "doctor AI assistant." Just one feature, the medical record assistant, can process nearly 6000 medical records in 8 hours, equivalent to the combined workload of 12 doctors in a top-tier hospital for a day! Sa14e6979-b613-41df-a26d-78fc6a75c405.png And as we mentioned earlier, according to the optimization tutorials provided by Hugging Face, deploying CPU for efficient inference is just a matter of a few simple steps. The simplicity and quick setup of optimization give CPUs a significant edge in the practical implementation of AI applications.

This means that in any scenario, big or small, once a single successful breakthrough in CPU optimization is achieved, it can be quickly replicated or scaled up with precision and speed. The result? More users can deploy AI applications in similar or related scenarios at a faster pace and with better cost efficiency.

After all, Intel is not just a hardware company; it also boasts a sizable software team. Having accumulated a wealth of optimization methods and tools during the traditional deep learning era, such as the widely used OpenVINO toolkit in industries like manufacturing and retail.

In the era of large models, Intel has also forged deep collaborations with mainstream large models such as Llama 2, Baichuan, and Qwen. Taking the Intel Extension for Transformer toolkit as an example, it can accelerate the inference performance of large models by up to 40 times.

Furthermore, the clear trend in large models nowadays is towards more application-specific optimizations. Thus, enabling a constant stream of new applications to be deployed efficiently and economically becomes paramount.

Hence, it's not difficult to understand why more and more people are opting for CPUs for AI inference. Perhaps we can reinforce this point by quoting Intel CEO Pat Gelsinger's remarks during a media interview at the end of 2023:

"From an economic standpoint for inference applications, I'm not going to build a backend environment entirely with H100s costing forty thousand dollars each because they consume too much power and require building new management and security models, along with new IT infrastructure."

"If I can run these models on standard Intel chips, then these problems won't arise."

Looking back at 2023, large models were undoubtedly the absolute focus of the AI community. However, as we enter 2024, a noticeable trend is the acceleration of various technological advancements and the progress of applications across industries, leading to a situation where progress is being made on multiple fronts simultaneously.

In this scenario, it's foreseeable that there will be an emergence of more AI inference demands, and the proportion of inference compute power in the overall AI compute demand will only increase. For example, AI video generation represented by Sora is speculated to require significantly less training compute power compared to large models, but the inference compute power demand is hundreds or thousands of times greater. Additionally, optimizations such as video transmission required for the deployment of AI video applications are also a strong suit for CPUs.

Therefore, taking everything into account, the positioning of CPUs within Intel's AI Everywhere vision becomes clear: to complement areas not covered or inadequately covered by GPUs or dedicated accelerators, providing flexible compute choices for a wider range of diverse and complex scenarios. While strengthening general-purpose computing, CPUs become essential infrastructure for the widespread adoption of AI. S7fa832ce-f1a4-4d25-bb34-2d59ad60095a.png