Jensen Huang Unleashes a Megaton with 2700W-Powered GPUs and 240TB-Worth AI Supercomputer

Shang Fang Wen Q Mon, Mar 25 2024 09:10 AM EST

March 19 news, early this morning, Jensen Huang officially unveiled the next-generation Blackwell GPU architecture, along with its B100/B200 GPU chips, GB200 superchip, and DGX supercomputer, taking the "tactical nuke" to a whole new level and dominating the global stage. Blackwell B200 GPU

The Blackwell B200 GPU features a chiplet architecture for the first time, comprising two B100 chiplets connected via the high-bandwidth fifth-generation NVLink 5 bus, which offers up to 1.8TB/s of bandwidth. Up to 576 B200s can be interconnected.

The B100 chiplet is fabricated on a custom-designed TSMC 4NP process (an enhancement of the 4N process used in the H100/RTX 40 series), which has reached the limit of double patterning lithography. The chiplets are interconnected by a 10TB/s on-chip interconnect, forming a unified B200 GPU.

The B100 chiplet integrates a massive 104 billion transistors, a 30% increase over the 80 billion transistors in the previous-generation H100. The B200 GPU as a whole features a staggering 208 billion transistors.

The die size of the B100 chiplet has not been disclosed yet, but given the process limitations, it is unlikely to be significantly larger than the 814mm² H100.

The number of CUDA cores in the B100 chiplet has also not been disclosed, but it is expected to be significantly higher than the 16,896 cores in the H100. Some speculate that it could exceed 20,000 cores.

Each B100 chiplet is paired with four 24GB HBM3E memory modules with an effective frequency of 8GHz and a width of 4096 bits, providing a bandwidth of 4TB/s per chiplet.

Consequently, the B200 GPU offers a total of 192GB of HBM3E memory with a combined width of 8096 bits and a total bandwidth of 8TB/s, representing increases of 1.4×, 58%, and 1.4× over the H100, respectively. Performance The B200 adds support for the FP4 tensor data format, delivering a performance of 9 PFlops (9 quadrillion floating-point operations per second). INT/FP8, FP16, and TF32 tensor performance reaches 4.5, 2.25, and 1.1 PFlops, respectively, which represents a 1.2x, 1.3x, and 1.3x improvement. However, FP64 tensor performance has decreased by 40% (relying on the GB200). FP32 and FP64 vector performance figures have not been disclosed.

Blackwell Architecture The Blackwell GPU also features a second-generation Transformer Engine that supports new micropatch scaling. This, coupled with advanced dynamic range management algorithms in TensorRT-LLM and the NeMo Megatron framework, enables a doubling of compute power and model size with the new 4-bit floating-point AI inference capability.

Other Features Additional features include a dedicated RAS (Reliability, Availability, and Serviceability) engine, secure AI, and a decompression engine.

Power Consumption The B100 maintains the same 700W power consumption as the previous-generation H100, while the B200 introduces a new level of 1000W.

Claims and Applications NVIDIA claims that the Blackwell GPU is capable of AI training and real-time large language model inference on models with 10 trillion parameters.

The GB200 Grace Blackwell is the next-gen superchip in the Grace Hopper line, upgrading from a single GPU and CPU to two GPUs and one CPU. The GPU portion is the B200, while the CPU remains the Grace. They're connected via a 900GB/s bandwidth ultra-low power interposer.

For large language model inference workloads, the GB200 superchip delivers a performance boost of up to 30x compared to the H100.

However, this comes at a cost, with the GB200 consuming up to 2700W of power. Air cooling can be used, but liquid cooling is highly recommended. NVIDIA DGX SuperPOD: The AI Supercomputer Unleashing Massive Models

Powered by the groundbreaking GB200 superchip, NVIDIA's latest AI supercomputer, the DGX SuperPOD, packs a mammoth 36 of these chips aboard. Each chip boasts 36 Grace CPUs and 72 B200 GPUs, seamlessly connected via NVLink 5. The system's colossal memory capacity reaches up to 240TB HBM3E.

This AI behemoth tackles gargantuan models with trillions of parameters, ensuring uninterrupted training and inference for large-scale generative AI workloads. Its FP4 precision delivers an impressive 11.5 EFlops (1150 quadrillion floating-point operations per second).

The DGX SuperPOD's scalability is exceptional. It can be expanded to tens of thousands of GB200 superchips through Quantum-X800 InfiniBand networking. Each GPU gains a whopping 1.8TB/s of bandwidth thanks to the incorporation of BlueField-3 DPU data processing units.

The fourth-generation Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology boosts network computing capability to 14.4 TFlops, a fourfold increase over its predecessor. NVIDIA unveils the DGX B200, the sixth-generation universal AI supercomputer platform. It boasts dual Intel fifth-generation Xeon processors, eight B200 GPUs with 1.4TB HBM3E and 64TB/s bandwidth. Delivering an FP4 precision performance of 144PFlops (144 quadrillion floating-point operations per second), the DGX B200 enables real-time inference for trillion-parameter models with a 15x speed boost.

The DGX B200 integrates eight NVIDIA ConnectX-7 NICs and two BlueField-3 DPUs for high-performance networking, offering a whopping 400Gb/s bandwidth per connection. It can be extended to support even greater AI performance through the Quantum-2 InfiniBand and Spectrum?-X Ethernet networking platforms. Blackwell GPU-powered offerings are set to hit the market later this year, with adoption from Amazon Web Services, Dell, Google, Meta, Microsoft, OpenAI, Oracle, Tesla, xAI, and more.

Amazon Web Services, Google Cloud, Microsoft Azure, and Oracle Cloud will be among the first cloud providers to offer Blackwell GPU-powered instances, with NVIDIA Cloud Partner Network members Applied Digital, CoreWeave, Crusoe, IBM Cloud, and Lambda also offering such services.

Sovereign AI cloud providers, including Indosat Ooredoo Hutchinson, Nebius, Nexgen Cloud, Oracle Cloud Sovereign EU, Oracle Cloud Government US/UK/AU, Scaleway, Singtel, Taiga Cloud by Northern Data Group, and Yotta Data Services’ Shakti Cloud, YTL Power International, will also be providing cloud services and infrastructure based on Blackwell architecture.

pre：Honor Unveils Its First AI PC: Honor MagicBook Pro 16, Starting at $1,000

next：First Appearance 899 Yuan, Asgard's New DDR5 6800 Memory Sticks Are On Sale: Hynix Particles

Jensen Huang Unleashes a Megaton with 2700W-Powered GPUs and 240TB-Worth AI Supercomputer

Navigation

Related Articles