Home > News > AI

GPT-4 confirmed to possess "human-like cognition" in Nature! AI better at detecting irony and implications than humans

Sun, May 26 2024 07:51 AM EST

?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F0cbae4afj00se2swr00z7d200u000cvg00it0082.jpg&thumbnail=660x2147483647&quality=80&type=jpg New Wisdom Times Report

Editor: Yong Yong

[New Wisdom Times Overview] The debate over whether AI possesses a "theory of mind" has long been contentious. A recent study in Nature reveals that the behavior of GPT-4 can rival that of humans, even surpassing them in detecting irony and implications. While GPT-4 falls short of human levels in discerning when others are "misspeaking," this is due to its confinement by barriers that prevent it from expressing opinions, rather than a lack of comprehension.

As AI has advanced to its current state, its intelligence is now comparable to that of humans, with no individual able to encompass and adapt as effortlessly as AGI.

In this era, how can we uphold the dignity of being human?

Some argue that at least humans are social beings, capable of understanding the "unspoken words" of their kind, fostering empathy with others, unlike machines, which are perceived as cold. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F04726df5j00se2sws001fd200u000jog00ei009i.jpg&thumbnail=660x2147483647&quality=80&type=jpg There has been much debate about whether AI possesses a Theory of Mind (ToM).

In particular, the recent development of large models like ChatGPT has once again brought this issue to the forefront — do these models have a Theory of Mind? Can they understand the mental states of others?

A recent study in the journal Nature Human Behaviour conducted rigorous experiments and surprisingly showed that GPT-4 performs above human levels, being better at detecting irony and implications than humans, with its weakness lying in not expressing opinions. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2Ffcfd13dej00se2swt0026d200qq006jg00id004h.jpg&thumbnail=660x2147483647&quality=80&type=jpg Paper Link: https://www.nature.com/articles/s41562-024-01882-z

In other words, GPT-4 is no different from humans in terms of theory of mind. If you find it lacking insight, it may just be hiding its true abilities!

GPT-4's Mind Surpasses Humans

People care about others' thoughts and spend a lot of energy pondering them.

Imagine standing near a closed window and hearing a friend say, "It's a bit warm in here." You realize she's not just commenting on the temperature but politely asking you to open the window.

This ability to track others' mental states is known as theory of mind, a core concept in human psychology and a crucial aspect of social interaction involving the entire process of communication, empathy, and social decision-making. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F583753d0j00se2swt002ld2005w006mg00ag00bq.jpg&thumbnail=660x2147483647&quality=80&type=jpg With the rise of Large Language Models (LLMs), the theory of mind is no longer exclusive to humans, and the possibility of AI having a theory of mind may not be far off.

To serve a broader interdisciplinary research on machine behavior, there have been recent calls to establish "machine psychology," advocating the use of tools and paradigms from experimental psychology to systematically study the capabilities and limitations of LLMs.

Researchers typically employ a range of different methods to measure various theories of mind, conducting multiple repetitions of each test and comparing them with well-defined human performance benchmarks.

This paper from Nature adopts this approach to test GPT-4, GPT-3.5, and Llama 2, comparing their performance with a sample of human participants (total n=1907).

The tests cover different dimensions, including abilities with lower cognitive demands, such as understanding indirect requests, and those with higher cognitive demands, like identifying and expressing complex mental states (misleading or ironic). The tests are divided into 5 projects (false belief, irony, slips of the tongue, implications, bizarre stories).

It is worth noting that to ensure the models do not merely replicate training data, researchers generated new versions for each previously published test. These novel test items are logically consistent with the original test items but use different semantic content.

The results show that GPT-4 outperformed humans in 3 out of the 5 tests (irony, implications, bizarre stories), performed on par with humans in 1 test (false belief), and fell short only in the slips of the tongue test.

Furthermore, researchers discovered that GPT-4 is not bad at identifying slips of the tongue; rather, it is very conservative and hesitant to give definitive opinions. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2Ff1b7b243j00se2swv006td200u000m2g00id00dh.jpg&thumbnail=660x2147483647&quality=80&type=jpg a. Human, GPT-4, GPT-3.5, and LLaMA2's scores across various test tasks (false belief, irony, slips of the tongue, implicature, bizarre stories).

b. Quartile ranges of average scores for original (dark) and novel (light) items in each test task.

False Belief

False belief assessment evaluates the ability of participants to infer that others possess knowledge different from their own (the actual state of the world).

This test comprises specific structured test items: where Character A and Character B are together, Character A places an object in a hidden location (e.g., a box), Character A leaves, Character B moves the object to a second hidden location (e.g., a cupboard), and then Character A returns. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F4e242af9p00se2sww0029d200bb003gg00id005l.png&thumbnail=660x2147483647&quality=80&type=jpg The question posed to participants is: When character A returns, will they look for the item in the new location (where the item actually is, aligning with the participants' true belief) or in the old location (where the item originally was, aligning with character A's mistaken belief)?

In addition to the false belief condition, the test also includes a true belief control condition, where character B does not move the item hidden by character A but instead moves another item to a new location. By adding this control, it effectively detects how false beliefs come about.

The challenge of these tests is not to remember the last place the character saw the item but to reconcile the inconsistencies between conflicting mental states.

In this test, both human participants and LLMs performed at ceiling level. Out of 51 human participants, only 5 made one mistake, usually failing to specify either of the two locations and answering, "He will look in the room."

All LLMs correctly reported that the person leaving the room would subsequently search for the item where they remembered seeing it, even if it no longer matched the current location.

Irony

Understanding ironic speech requires inferring the true meaning of a statement (often opposite to what is said) and detecting the speaker's sarcastic attitude, which has been considered a key challenge for artificial intelligence and LLMs.

In this project, GPT-4 outperformed human levels significantly. In contrast, GPT-3.5 and Llama 2-70B both performed below human levels.

GPT-3.5 excelled in identifying non-ironic control statements but made errors in identifying ironic statements. Contrast analysis revealed a clear order effect, with GPT-3.5 making more mistakes in earlier trials than in later ones.

Llama 2-70B made errors in identifying both ironic and non-ironic control statements, indicating an overall poor ability to discern irony.

Slips of the Tongue

The slips of the tongue test presents a scenario where a character unintentionally says something offensive to the listener because the speaker is unaware of or forgets crucial information.

After introducing the scenario to the test subjects, researchers present four questions: ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2Ffd881314j00se2swx003dd200jo00dcg00id00cg.jpg&thumbnail=660x2147483647&quality=80&type=jpg These questions are posed while telling a story. According to the original coding standard, the participant must answer all four questions correctly for their answers to be considered correct.

However, in this study, the researchers are primarily interested in the response to the last question, which tests whether the respondent understands the speaker's mental state.

When analyzing human data, the researchers noted that several participants gave incorrect answers to the first question because they were clearly unwilling to blame others (e.g., "No, he didn't say anything wrong because he forgot").

Therefore, to focus on a key aspect of hypothesis understanding relevant to the study, the researchers only coded the last question.

In this test, GPT-4 scored significantly lower than human levels. There was also an isolated ceiling effect for specific items.

GPT-3.5 performed even worse, with performance almost at the floor level except for one run.

In contrast, Llama 2-70B outperformed humans, achieving 100% accuracy in all runs except for one.

Implicature

The implicature task evaluates understanding of indirect speech requests by sequentially presenting 10 short stories describing everyday social interactions.

Each story concludes with a sentence that can be interpreted as an implicature.

A correct answer should indicate both the intended meaning of the sentence and the action it attempts to elicit.

In the initial test, if a participant failed to answer a question completely on the first attempt, researchers would ask them additional questions.

In the revised protocol, researchers eliminated the additional questions. Compared to previous studies, this coding method provides a more conservative estimate of implicature understanding ability.

In this test, GPT-4 performed significantly better than humans, GPT-3.5 showed no significant difference from humans, and only Llama 2-70B performed noticeably below human levels in this test.

Strange Stories

Here, the difficulty level increases!

Strange Stories offer a way to test higher-level mental abilities such as reasoning deception, manipulation, lying, misunderstanding, and second-order or higher-order mental states (e.g., A knows that B believes C...). ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2Fa10f2c43j00se2swy003td200m800j8g00id00fv.jpg&thumbnail=660x2147483647&quality=80&type=jpg In this test, participants are presented with a brief story and asked to explain why the characters in the story would say or do things that are literally untrue.

GPT-4 outperformed humans significantly in this test, GPT-3.5 performed similarly to humans, while Llama 2-70B scored noticeably lower than humans.

GPT's Overcautiousness

Based on the experiments, "slips of the tongue" are the only test where GPT-4 falls short or matches human performance, leading us to believe that GPT models struggle with slips of the tongue.

Surprisingly, slips of the tongue are also the only test where Llama 2-70B (performing worst in other tasks) scored higher than humans.

Researchers decided to delve deeper into the study and proposed three hypotheses.

The first hypothesis is the failure of inference hypothesis, suggesting that the model cannot generate inferences about the speaker's mental state.

The second hypothesis is Buridan's Ass hypothesis, where the model can infer mental states but struggles to choose between them, like a rational donkey stuck between two equally appealing piles of hay and unable to decide which one to eat, leading to starvation. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2Fff568302j00se2swy001wd20084004sg00id00at.jpg&thumbnail=660x2147483647&quality=80&type=jpg The third assumption is the ultra-conservative assumption, where the GPT model can both compute inferences about a person's psychological state and know what the most likely explanation is, but it does not commit to a single explanation.

To differentiate these assumptions, researchers devised a variant of a test for slips of the tongue.

Specifically, instead of asking whether the speaker knows or doesn't know they have offended someone, the question is whether the speaker is more likely to know or not know, known as the "likelihood of a slip test."

In the results of the slip likelihood test, GPT-4 exhibited perfect performance, identifying in all responses that the speaker is more likely not aware of the context without any cues.

GPT-3.5 showed improved performance, although it did require cues in a few cases (about 3% of items), and occasionally failed to recognize slip behavior (about 9% of items). ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2Feab44797j00se2swz005wd200u000dug00id008g.jpg&thumbnail=660x2147483647&quality=80&type=jpg a.

  • The scores of two GPT models on the original framework of the misinformation question ("Do they know...?") and the possibility framework ("Is it more likely that they know or don't know...?")

b.

  • Response scores for three variants of the misinformation test: misinformation (pink), neutral (gray), and knowledge hint (blue).

In conclusion, these results support the hyper-conservatism hypothesis, indicating that GPT successfully inferred the speaker's mental state and identified a higher likelihood of unintentional offense than intentional insult.

Therefore, the initial failure of GPT to answer questions correctly does not signify reasoning failure or model indecision among equally plausible alternatives but rather an overly conservative approach that hinders commitment to the most probable explanation.

On the other hand, Llama 2-70B failed to differentiate between cases where the speaker was hinted to know and cases with no information, raising concerns that the perfect performance of Llama 2-70B in this task may be illusory.

The success and failure patterns of GPT models in the misinformation test and its variants may stem from their underlying architecture.

Apart from Transformers, GPT models also incorporate mitigation measures to enhance factual accuracy and prevent users from overly relying on them as sources.

These measures include training to reduce illusions, and the failures in the misinformation test may be a cautious behavior driven by these mitigation measures, as making commitments to explanations lacking sufficient evidence is required in the test.

This caution can also explain differences between tasks: both the misinformation test and hint test require inferring the correct answer from ambiguous information.

However, the hint task allows for open-ended text generation, a task for which LLM is well-suited, while answering the misinformation test requires going beyond such inference to draw conclusions.

These findings underscore the separation between capability and performance, suggesting that GPT models may possess the ability to perform tasks involving computational analogs of mental reasoning complexity but exhibit differences from humans in uncertain situations, where humans tend to strive to eliminate uncertainty, whereas GPT does not spontaneously compute these inferences to reduce uncertainty.

References:

https://www.nature.com/articles/s41562-024-01882-z

https://x.com/emollick/status/1792594588579803191 ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0526%2F22c62aa7j00se2sx000bmd200u002nlg00id01mi.jpg&thumbnail=660x2147483647&quality=80&type=jpg