Home > News > AI

Hinton Reveals Ilya's Growth Journey: Scaling Law Was His Intuition Since Student Days

Tue, May 28 2024 07:42 AM EST

On a Sunday in the summer of 2003, AI pioneer Hinton was coding in his office at the University of Toronto when he was interrupted by a somewhat brash knock on the door.

Standing outside was a young student who had spent the entire summer working at a french fry joint but was more eager to join Hinton's lab.

Hinton asked, "Why didn't you make an appointment? We could have had a proper chat."

The student retorted, "How about we chat now then?" ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2Fc049b41ej00se32440017d000h800h8m.jpg&thumbnail=660x2147483647&quality=80&type=jpg This young student is Ilya Sutskever, who just completed his second year of undergraduate studies in mathematics. Upon inquiring at the academic office, he was advised, "If you want to study machine learning, it's best to approach Professor Hinton."

He followed this advice, leading to a legendary journey:

From AlexNet to AlphaGo, he was involved in two groundbreaking research endeavors that changed the world.

At the inception of OpenAI, he was recruited as the Chief Scientist. Under his guidance, they released early versions of the GPT series, the DALL·E series, the large-scale model Codex, and eventually ChatGPT, once again revolutionizing the world.

Years later, he sparked internal turmoil within the board of directors, ultimately parting ways with OpenAI, leaving the world eagerly anticipating his next move. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2Fb34a5d40j00se3244002id000u000mim.jpg&thumbnail=660x2147483647&quality=80&type=jpg During his time at OpenAI, Ilya was not as visible as Altman, nor did he share his "programming philosophy" online like Brockman.

In the few speeches and interviews he gave, Ilya mostly discussed technology and big-picture thinking, rarely delving into his own experiences, and he has been keeping a low profile in the past six months.

This story comes from his doctoral advisor, Geoffrey Hinton.

In a recent conversation with the founder of Sana Labs, Hinton not only talked about his own experiences but also reminisced about some events from their time working together as mentor and mentee.

Over 20 years have passed, yet many details remain vivid in Hinton's recollections. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F0b254e0dj00se3244001ad000u000jum.jpg&thumbnail=660x2147483647&quality=80&type=jpg The interview video naturally went viral, not only for the anecdotes but also for delving into some of Ilya's academic ideas and their evolution:

  • In 2010, Ilya developed a language model using GPUs.
  • The Scaling Law initially stemmed from his intuition.
  • Both of them agree that "language models are not just about predicting the next token."
  • They both concur that "prediction is compression, and compression is intelligence." ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F1c20d2abj00se32440017d000u0008tm.jpg&thumbnail=660x2147483647&quality=80&type=jpg So, what does Ilya look like in Hinton's eyes?

Astounding raw intuition

After Ilya joined the lab, the first task Hinton assigned to him was to read a paper on backpropagation.

At the next weekly meeting, Ilya came back and reported, "I don't understand."

Hinton was quite disappointed, thinking to himself, "This kid seems pretty sharp, how come he can't grasp something as basic as the chain rule for derivatives?"

Ilya quickly explained, "Oh, I understand that part, what I don't get is why not add a sensible functional optimizer to the gradients?"

It took Hinton's team several years to address this issue, and it was Ilya, who had just started a week ago, who initially pointed out the problem. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F3a64ce46j00se3244002gd000u000p2m.jpg&thumbnail=660x2147483647&quality=80&type=jpg Situations like this kept happening repeatedly... Hinton evaluated Ilya in this way:

But Hinton also mentioned not being able to figure out where Ilya's intuition came from. Perhaps it's due to his early interest in artificial intelligence problems, coupled with his excellent mathematical foundation.

Apart from researching intuition, Ilya demonstrated exceptional coding and engineering skills during his student years.

Back then, popular frameworks like TensorFlow or Torch didn't exist, and the main tools and platforms for machine learning were in Matlab.

For a task that required a lot of matrix multiplication code adjustments in Matlab, Ilya grew impatient after a while and said he wanted to create a user interface for Matlab:

Upon hearing this, Hinton earnestly advised him, "You shouldn't do that. It will take a month, and we shouldn't get distracted. Let's finish the current project first."

Ilya casually replied, "Oh, it's no big deal. I already finished it this morning." ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F2c2f99c5j00se3244001vd000u000ibm.jpg&thumbnail=660x2147483647&quality=80&type=jpg

  • This work appears in Ilya's doctoral dissertation.

  • Since childhood, he firmly believed in the Scaling Law.

  • As Hinton put it, Ilya had remarkable intuition on many issues.

  • The Scaling Law, now embraced by many in the AI community, was already a firm belief of Ilya during his student days, and he took every opportunity to recommend it to those around him.

  • Later, at the inception of OpenAI, Ilya's articulation became even more refined.

  • In the early days, Hinton saw this as a form of "avoiding responsibility" for researchers lacking innovative ideas. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2Fa9eb8d48j00se3244000yd000u0009km.jpg&thumbnail=660x2147483647&quality=80&type=jpg Hinton mentioned that at that time, no one dared to predict that computer speeds would increase by a billion times in the future; the most they could imagine was a mere 100-fold increase.

(Ilya joined Hinton's lab in 2003, but it's unclear when he started thinking about the Scaling Law concept; it may have been circling in his mind for over 20 years.) ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2Fd52535a1j00se3244001md000u000jzm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Until 2020, several months before the release of GPT-3, the OpenAI team formally defined and introduced this concept in a paper. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F95384835j00se3244000pd000u0009dm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Using GPUs for language models predates AlexNet

In late 2010, Ilya and another student, James Martens (now a research scientist at DeepMind), collaborated on a language model that was later accepted at ICML 2011.

They employed an RNN architecture, trained on Wikipedia data on 8 of the most advanced GPUs at the time, predating the use of GPUs on AlexNet by two years. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F945505dbj00se3244000ud000u0007ym.jpg&thumbnail=660x2147483647&quality=80&type=jpg Unlike today's large language models predicting the next token, they were trying to predict one character at a time.

This model has its limitations. For instance, given a starting text, the model can generate sentences that resemble Wikipedia articles.

While the semantics may seem like gibberish, the grammar and punctuation are mostly accurate. Quotation marks and parentheses appear in pairs, and subject-verb agreement is maintained. For example, in a passage from a research paper: ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F3e42a198j00se3244007pd0012800ykm.jpg&thumbnail=660x2147483647&quality=80&type=jpg During an interview for the University of Toronto magazine, Ilya believed this had exceeded everyone's expectations:

Hinton, in all rationality, couldn't believe that the system could "understand" anything, but it appeared as though it did.

For instance, when given a list of locations, it could continue generating places, even though it couldn't distinguish between countries and states. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F4209fa32j00se32440030d0012k00fcm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Ilya at the time was not willing to discuss the potential applications of this work.

Following their success on Wikipedia, the team then attempted to analyze New York Times articles, aiming to teach the system to recognize different authors based on their writing styles.

However, Ilya had already considered and acknowledged that if done well, this technology could one day become the foundation for plagiarism detection software.

Today, the code for this paper still resides on servers at the University of Toronto, available for those interested in further research. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F3bf3a5d7j00se3244001xd000u000g0m.jpg&thumbnail=660x2147483647&quality=80&type=jpg Not just predicting the next token

Later, with stories like AlexNet and the trio "auctioning" themselves to join Google, which are well-known, let's skip over them for now.

After Ilya joined OpenAI, even though he no longer worked with Hinton, their academic ideas remained on the same path. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F29de6ecfj00se32440010d000u000gwm.jpg&thumbnail=660x2147483647&quality=80&type=jpg After the advent of ChatGPT, many critics argue that large models are essentially just statistics, predicting the next token, much like a parrot randomly mimicking human speech.

However, both Hinton and Ilya, the master and apprentice, believe it is much more than that.

In Hinton's view, the next token after a question is the first token of the answer.

Therefore, learning to predict means one must learn to understand the question.

This understanding is akin to human cognition, yet fundamentally different from the old-fashioned triplet-based data completion. 60ad175bg00se324502i0d000hs009zm.gif Ilya has been tirelessly promoting this theory, mentioning it in a conversation with NVIDIA's CEO Jensen Huang last year and discussing it in the final public interview before the recent internal strife at OpenAI. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F55a354b4j00se3244001nd000u000g0m.jpg&thumbnail=660x2147483647&quality=80&type=jpg In another interview, he went further:

This is what Ilya believes is why the "predict the next token" paradigm could lead to AGI, and even potentially surpass humans all the way to ASI. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2Fee1a313aj00se3244001hd000u000gum.jpg&thumbnail=660x2147483647&quality=80&type=jpg Prediction is compression, compression is intelligence.

When discussing "predicting the next token," Ilya often mentions "compression," as he believes prediction is compression, and compression is the essence of intelligence.

However, Ilya typically explains this idea from a theoretical perspective, which may not be easily understood by everyone.

For instance, during a lecture at UC Berkeley, he explained it as follows:

  • A "Kolmogorov compressor" is a theoretical program that can generate a specific dataset with the shortest length, minimizing regret.

  • Stochastic gradient descent can be seen as searching for an implicit "Kolmogorov compressor" within the weights of a soft computer (such as a large Transformer).

  • The larger the neural network, the better it can approximate the "Kolmogorov compressor," resulting in lower regret values. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F6cf1596fj00se3244000yd000u000gnm.jpg&thumbnail=660x2147483647&quality=80&type=jpg Hinton also agrees with this statement and gave a very vivid example in the interview.

If you ask GPT-4 about the similarities between compost and an atomic bomb, most humans wouldn't be able to answer and would think they are two very different things.

GPT-4 would tell you that, although their energy scales and time scales are different, they still have similarities:

  • When compost heats up, the rate at which heat is produced increases.
  • When an atomic bomb produces more neutrons, the rate at which neutrons are produced also increases.

Through analogy, AI understands the concept of a "chain reaction."

Hinton believes that AI uses this understanding to compress all information into its weights.

What does a good student look like in Hinton's eyes?

Going back to when the two met, Hinton mentioned that you could tell he was very smart after talking to him for a short while.

After talking a bit more, you could see he had great intuition and was good at math.

So, choosing Ilya as a student was a very easy decision to make.

How do you pick other students? Hinton also used the method Ilya was best at: following intuition.

If you try to absorb everything you're told, you'll end up with a very fuzzy framework. Believe everything, but it's useless.

So, a good student in Hinton's eyes should have a firm worldview and try to manipulate the input facts to fit your viewpoint.

Later on, we can see that both of them held onto this belief, insisting that "big models are not just predicting the next token," and insisting that "prediction is compression, compression is intelligence."

They both also insist that the world should pay more attention to the risks posed by AI, one leaving Google for 10 years because of this, and the other leaving OpenAI, which they helped build.

Full interview with Hinton: https://www.youtube.com/watch?v=tP-4njhyGvo

References: [1] https://x.com/joelhellermark/status/1791398092400390195 [2] https://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf [3] https://magazine.utoronto.ca/people/students/ilya-sutskever-google-phd-fellowship/ [4] https://www.utoronto.ca/news/u-t-alum-leading-ai-research-1-billion-non-profit-backed-elon-musk [5] https://icml.cc/2011/papers/524_icmlpaper.pdf [6] https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s52092 [7] https://www.youtube.com/watch?v=Yf1o0TQzry8

— End —