Wait, what? Google managed to steal from OpenAI and even got their hands on crucial information about the gpt-3.5-turbo model?? Yes, you read that right.
According to Google itself, not only did it reverse-engineer the entire projection matrix of the OpenAI large model, but it also figured out the exact size of the hidden dimensions.
And the method is incredibly simple—
All it takes is accessing the API, with less than 2000 clever queries.
The cost, based on the number of calls, is less than $20 (approximately 150 RMB), and this method also applies to GPT-4.
Well, well, Ultraman got outsmarted this time! Google's latest research reports a method for stealing critical information from large models.
Using this method, Google has cracked the entire projection matrix of two foundational models in the GPT series, Ada and Babbage. Key information like hidden dimensions was also directly compromised:
One is 1024, and the other is 2048. So, how does Google achieve this?
Attacking the last layer of big models
The core of this method attacks the embedding projection layer of the model, which is the last layer responsible for mapping hidden dimensions to logits vectors.
Since the logits vectors essentially lie within a low-dimensional subspace defined by the embedding projection layer, targeted queries to the model's API can extract either the embedding dimensions or the final weight matrix.
By conducting a large number of queries and applying Singular Value Decomposition (SVD), one can identify the hidden dimensions of the model.
For example, by performing over 2048 queries on the Pythia 1.4B model, if the peak in the graph appears at the 2048th singular value, it indicates that the model's hidden dimensions are 2048. Visualizing the differences between consecutive singular values can also be used to determine the hidden dimensions of a model. This approach can validate whether crucial information has been successfully extracted from the model.
In the Pythia-1.4B model, a peak appears when the query count reaches 2047, indicating that the hidden dimension size of the model is 2048. And attacking this layer can reveal the "width" of the model (i.e., the total number of parameters in the model) as well as more global information, and it can also reduce the opacity of a model, paving the way for subsequent attacks.
The research team found that this type of attack is highly efficient. It doesn't require too many queries to obtain crucial information about the model.
For example, attacking OpenAI's Ada and Babbage and obtaining the entire projection matrix would cost less than $20; attacking GPT-3.5 would require around $200.
It is applicable to generative models where the API provides complete log probabilities or logit biases, such as GPT-4 and PaLM2. In the paper, it's noted that while the attack method doesn't yield much model information, the fact that the attack itself can be executed is quite alarming.
Reported to OpenAI.
With such critical information being cracked by competitors at such low costs, can OpenAI afford to remain passive?
Ahem, the good news is: OpenAI is aware, and the word has been forwarded internally. As legitimate security researchers, our team obtained consent from OpenAI before extracting the parameters of the model's final layer.
After completing the attack, we also confirmed the effectiveness of the method with OpenAI and ultimately deleted all data related to the attack.
So, as some netizens jokingly put it:
Some specific numbers weren't disclosed (like the hidden dimensions of gpt-3.5-turbo), but hey, it's like OpenAI begging you for it. One noteworthy point is that the research team also includes a researcher from OpenAI.
The main contributors to this study hail from Google DeepMind, with additional researchers from the Swiss Federal Institute of Technology in Zurich, the University of Washington, McGill University, and one staff member from OpenAI.
Furthermore, the authors proposed defensive measures including:
Starting from the API, completely removing logit bias parameters; or directly modifying the last layer's hidden dimensions h after training is completed, and so on.
Based on this, OpenAI ultimately chose to modify the model's API, making it impossible for "interested parties" to replicate Google's actions.
But regardless:
This experiment by teams like Google demonstrates that locking the door at OpenAI may not necessarily be foolproof.
(Perhaps it's time for OpenAI to proactively open up a bit?)