Home > News > Hardware

NVIDIA's GPU is so yesterday! World's No.1 AI chip upgraded with 4 trillion transistors, 900,000 cores

Shang Fang Wen Q Tue, Mar 19 2024 08:58 AM EST

On March 14th, Cerebras Systems unveiled their third-generation wafer-scale AI acceleration chip, "WSE-3" (Wafer Scale Engine 3), boasting even more insane specs. And they've managed to double the performance without any increase in power consumption or cost.

The first generation WSE-1, launched in 2019, was built on TSMC's 16nm process and had a massive area of 46,225 square millimeters. It packed 1.2 trillion transistors, featured 400,000 AI cores, 18GB of SRAM cache, supported 9PB/s memory bandwidth, 100Pb/s interconnect bandwidth, all while guzzling 15 kilowatts of power. 9d24e327-7244-4271-a118-1a6b9e6a792c.jpg In 2021, the second-gen WSE-2 saw an upgrade to TSMC's 7nm process. Despite its constant size of 46,225 square millimeters, the transistor count skyrocketed to 2.6 trillion. The number of cores increased to 850,000, with the cache expanding to 40GB. Memory bandwidth hit a staggering 20PB/s, and interconnect bandwidth reached 220Pb/s. cb9d8cf8-6b70-4e9e-a7db-97945d5963ef.jpg The latest third-gen WSE-3 has been upgraded to TSMC's 5nm process technology. While the exact size wasn't mentioned, it's reasonable to assume it remains similar since manufacturing a chip requires a whole wafer, limiting how much larger it can get.

The transistor count has soared to an astonishing 40 trillion, and the number of AI cores has increased to 900,000. The cache capacity has reached 44GB, with options for accompanying external memory capacities of 1.5TB, 12TB, or 1200TB.

At first glance, the increase in core numbers and cache capacity might seem modest, but the performance leap is significant. The peak AI computing power has reached 125PFlops, which means 1.25 quintillion floating-point operations per second, rivaling the top supercomputers. s_9603efccbc6e44848a34e4b347ae8969.jpg It's capable of training next-gen AI mega-models equivalent to GPT-4 or Gemini, with several times more parameters, up to 24 trillion, all within a single logical memory space without the need for partitioning or restructuring.

Using it to train a mega-model with 1 trillion parameters is like training one with 10 billion parameters using GPUs.

With four in parallel, it can fine-tune 700 billion parameters in a single day, supporting up to 2048 interconnects, enabling the training of Llama's 700 billion parameters within a day.

Specific power consumption and pricing for WSE-3 haven't been disclosed, but based on the previous generation, it should be around $2 million or more. s_daa4c17f2e844764bd58550caf9be65b.png

s_ef313e2a90ce470c9c1dae62eeb9a952.jpg

s_34da83321e0b418fa94e2df075d32008.jpg