Home > News > AI

Elon Musk spends billions to build the largest supercomputing center, using 100,000 H100 chips to train Grok to catch up with GPT-4.

Mon, May 27 2024 08:01 AM EST
?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2Ff9f0701dj00se4o4l00azd200u000cvg00it0082.jpg&thumbnail=660x2147483647&quality=80&type=jpg New Wisdom Times Report

Editor: Joe Yang Feeling Sleepy

New Wisdom Times Overview: Recently, Elon Musk, who has been quiet for a while, has released big news - his AI startup xAI will invest heavily in building a supercomputing center to ensure the training of Grok 2 and future versions. This "supercomputing factory" is expected to be completed in the fall of 2025, with a scale four times larger than the current largest GPU cluster.

Recently, OpenAI, Google, and Microsoft have successively held conferences, intensifying the competition in the AI community.

In such a lively atmosphere, how could Musk be absent?

After being busy with Tesla and Starlink for a while, he seems to have freed up some time recently, and not making any noise until now, he directly released a big news - he is going to build the world's largest supercomputing center.

In March of this year, his xAI released the latest version of Grok 1.5, and since then, there have been rumors about the upcoming Grok 2, but there has been no official announcement yet. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F145b3053j00se4o4l0026d200u0007jg00id004l.jpg&thumbnail=660x2147483647&quality=80&type=jpg Could it be because of insufficient computing power?

Yes, that's right. Even billionaires may not be able to buy enough chips. In April this year, he personally stepped in to say that the lack of sufficient advanced chips delayed the training and release of the Grok 2 model. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F33ee9552j00se4o4m0027d200u0007vg00id004t.jpg&thumbnail=660x2147483647&quality=80&type=jpg He mentioned that training Grok 2 would require around 20,000 NVIDIA H100 GPUs based on the Hopper architecture, and added that Grok 3 models and higher versions would need 100,000 H100 chips.

Tesla's first-quarter financial report also revealed that the company had been constrained by computing power limitations. Musk's initial plan was to deploy 85,000 H100 GPUs by the end of the year, with most of the $6 billion raised from Redwood Capital and other investors allocated to chip expenses.

Currently, each H100 is priced at approximately $30,000, totaling $28 billion just for the chips alone, excluding construction costs and other server equipment.

According to Musk's estimates, this chip inventory would be more than sufficient for training Grok 2.

However, after possibly contemplating for a month, Musk felt that this step was not big enough or groundbreaking. After all, xAI's goal is to directly compete with formidable opponents like OpenAI and Google, and future model training cannot afford to lag behind due to computing power limitations.

Therefore, he recently publicly stated that xAI would need to deploy 100,000 H100s to train and operate the next version of Grok.

Moreover, xAI plans to interconnect all chips into a massive computer - what Musk refers to as the "Gigafactory of Compute."

This month, Musk has already informed investors that he aims to have this supercomputer operational by the fall of 2025, and he will personally ensure its timely delivery, as it is crucial for developing LLM.

This supercomputer may be jointly built by xAI and Oracle. In recent years, xAI has rented servers from Oracle with approximately 16,000 H100 chips, making it the largest source of chip orders.

Without developing its own computing power, xAI may end up spending around $10 billion on cloud servers in the coming years, making the "Gigafactory of Compute" a more cost-effective solution.

Once completed, this "Gigafactory of Compute" will be at least four times the size of the current largest GPU cluster. For instance, Meta's official website data from March showed the launch of two clusters containing 24,000 H100 GPUs for Llama 3 training. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F12ebb3e5j00se4o4o00i4d200u000lig00id00d5.jpg&thumbnail=660x2147483647&quality=80&type=jpg NVIDIA has announced the production and delivery of the new Blackwell B100 GPU architecture starting in the second half of this year. However, Musk's current plan is to procure the H100.

Why opt for bulk purchases of an almost obsolete model instead of the latest chip? The reason, as explained by Huang himself, is that "time is crucial in today's AI competition."

Nevertheless, even if everything goes smoothly and the "supercomputing factory" under Musk's "personal responsibility" delivers on time, whether this cluster will still have a scale advantage by next autumn remains uncertain.

Zuckerberg had posted on Instagram in January that Meta will deploy an additional 350,000 H100s by the end of this year, adding to the existing computing power equivalent to 600,000 H100s, but he did not mention the number of chips in a single cluster. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F98315e5dj00se4o4o002vd200g200iwg00ei00h2.jpg&thumbnail=660x2147483647&quality=80&type=jpg But within just half a year, this number nearly doubled. Before the release of Llama 3 in early May, there were reports that Meta had purchased an additional 500,000 GPUs from NVIDIA, bringing the total to 1 million GPUs, with a retail value of $30 billion.

Meanwhile, Microsoft aims to have 1.8 million GPUs by the end of the year, while OpenAI is even more ambitious, aiming to utilize 10 million GPUs for their latest AI models. Both companies are also in discussions to develop a supercomputer worth $100 billion, incorporating millions of NVIDIA GPUs. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2Fcbcbe130j00se4o4r00hbd200u000p4g00id00fd.jpg&thumbnail=660x2147483647&quality=80&type=jpg Who will emerge victorious in this battle of computational power? ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2Fb1e9a751j00se4o4t00ssd200u000u6g00ag00ai.jpg&thumbnail=660x2147483647&quality=80&type=jpg It should be NVIDIA.

Not only H100, NVIDIA CFO Colette Kress has mentioned a priority customer list for the flagship Blackwell chip, including OpenAI, Amazon, Google, xAI, and more.

The upcoming B100, as well as NVIDIA's chips to be updated annually, will continuously enter the supercomputing centers of tech giants to help them upgrade their computing power.

Chip shortages, insufficient power supply

When discussing Tesla's computing power issues, Musk also added that while chip shortages have been a major constraint on AI development so far, power supply will be crucial in the next one or two years, possibly even surpassing chips as the biggest limiting factor.

Factors such as power supply are crucial in choosing the location for this newly built "supercomputing factory." A data center with 100,000 GPUs may require a dedicated power supply of 100 megawatts.

To provide this level of power, the San Francisco Bay Area where xAI headquarters are located is evidently not an ideal choice. To reduce costs, data centers are often built in remote areas where power is cheaper and more abundant.

For example, Microsoft and OpenAI, besides planning the trillion-dollar supercomputer, are also constructing large data centers in Wisconsin at a cost of around $10 billion; Amazon Web Services' data centers are located in Arizona.

A very likely location for the "supercomputing factory" is Tesla's headquarters in Austin, Texas.

Last year, Tesla announced the construction of Dojo, which is deployed here. This supercomputer is based on custom chips, assisting in training AI for autonomous driving software and can also provide cloud services externally.

The first Dojo runs on 10,000 GPUs, with a construction cost of around $300 million. Musk stated in April that Tesla currently has a total of 35,000 GPUs for training its autonomous driving systems. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2Fb0ebfd3ej00se4o4v00evd200u000p4g00id00fd.jpg&thumbnail=660x2147483647&quality=80&type=jpg Training models in data centers is an extremely power-intensive process. It is estimated that training GPT-3 consumes 1287 megawatt-hours of electricity, roughly equivalent to the annual consumption of 130 American households.

More than just Elon Musk, CEOs are taking notice of the AI power issue. Sam Altman himself invested $375 million in the startup Helion Energy, which aims to utilize nuclear fusion to provide a more environmentally friendly and cost-effective way to run AI data centers.

Unlike Altman, Musk is not betting on fusion technology. He believes that AI companies will soon be competing for step-down transformers, which can convert high-voltage currents into electricity usable by the grid. "A massive drop from power obtained from the utility grid (e.g., 300 kilovolts) to below 1 volt."

After chips, the AI industry will need "transformers for Transformers."

References:

https://www.theinformation.com/articles/musk-plans-xai-supercomputer-dubbed-gigafactory-of-compute?rc=epv9gi

https://www.inc.com/ben-sherry/elon-musk-touts-nvidia-dominance-predicts-a-giant-leap-in-ai-power.html

https://finance.yahoo.com/news/jensen-huang-elon-musk-openai-182851783.html?guccounter=1 ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0527%2F22c62aa7j00se4o4v00bmd200u002nlg00id01mi.jpg&thumbnail=660x2147483647&quality=80&type=jpg