On March 14th, Cerebras Systems unveiled their third-generation wafer-scale AI acceleration chip, "WSE-3" (Wafer Scale Engine 3), boasting even more insane specs. And they've managed to double the performance without any increase in power consumption or cost.
The first generation WSE-1, launched in 2019, was built on TSMC's 16nm process and had a massive area of 46,225 square millimeters. It packed 1.2 trillion transistors, featured 400,000 AI cores, 18GB of SRAM cache, supported 9PB/s memory bandwidth, 100Pb/s interconnect bandwidth, all while guzzling 15 kilowatts of power.
In 2021, the second-gen WSE-2 saw an upgrade to TSMC's 7nm process. Despite its constant size of 46,225 square millimeters, the transistor count skyrocketed to 2.6 trillion. The number of cores increased to 850,000, with the cache expanding to 40GB. Memory bandwidth hit a staggering 20PB/s, and interconnect bandwidth reached 220Pb/s.
The latest third-gen WSE-3 has been upgraded to TSMC's 5nm process technology. While the exact size wasn't mentioned, it's reasonable to assume it remains similar since manufacturing a chip requires a whole wafer, limiting how much larger it can get.
The transistor count has soared to an astonishing 40 trillion, and the number of AI cores has increased to 900,000. The cache capacity has reached 44GB, with options for accompanying external memory capacities of 1.5TB, 12TB, or 1200TB.
At first glance, the increase in core numbers and cache capacity might seem modest, but the performance leap is significant. The peak AI computing power has reached 125PFlops, which means 1.25 quintillion floating-point operations per second, rivaling the top supercomputers.
It's capable of training next-gen AI mega-models equivalent to GPT-4 or Gemini, with several times more parameters, up to 24 trillion, all within a single logical memory space without the need for partitioning or restructuring.
Using it to train a mega-model with 1 trillion parameters is like training one with 10 billion parameters using GPUs.
With four in parallel, it can fine-tune 700 billion parameters in a single day, supporting up to 2048 interconnects, enabling the training of Llama's 700 billion parameters within a day.
Specific power consumption and pricing for WSE-3 haven't been disclosed, but based on the previous generation, it should be around $2 million or more.