A test of artificial intelligence

Judged as an artwork, GPT-4’s unicorn won’t win any prizes. The assortment of geometric shapes produced by the deep-learning algorithm only loosely captures the appearance of the majestic mythical beast. But when Sébastien Bubeck, a machine-learning specialist at Microsoft Research in Redmond, Washington, looks at the image, he sees something striking: a demonstration of the program’s ability to apply rudimentary reasoning.

In work detailed in a 2023 preprint¹, his team subjected GPT-4 — a creation of artificial intelligence (AI) company OpenAI — to dozens of diverse challenges, including writing proofs of mathematical theorems in the style of Shakespeare and navigating a virtual map using written instructions. One test tasked the algorithm with generating code that could draw a unicorn. The researchers then heavily modified the code, erasing the unicorn’s horn in the process, and asked GPT-4 to revise the new code to produce an intact unicorn. Bubeck thinks that this required GPT-4 to have some concept of the animal’s anatomy. It had to identify the ‘head’ element of this newly generated code, determine how the horn should be situated and oriented, and then update the code accordingly — which it did.

“To me, this is undeniable evidence that reasoning is happening in those systems,” he says. The preprint describing his team’s extensive evaluation was provocatively titled Sparks of artificial general intelligence: early experiments with GPT-4.

GPT-4's attempt at drawing a unicorn is passable but does it demonstrate that the algorithm can reason? Credit: arXiv: 2303.12712, CC by 4.0.

Artificial general intelligence is a loaded term. The idea is intended as a contrast to AI systems designed for specific tasks, such as playing chess or solving protein structures. Artificial general intelligence could instead be directed at a broad range of problems. GPT-4 and other contemporary systems based on large language models (LLMs), such as Google’s Bard or ERNIE Bot by Chinese technology company Baidu, certainly fit that part of the bill. These algorithms are based on an approach known as deep learning, which means that the software uses neural-network architecture that imitates key features of the way in which the brain processes and responds to information. After being fed tremendous amounts of training data, deep learning makes it possible to identify connections in the data. When given a text prompt, these models can then use these connections to formulate an appropriate response.

Modern LLMs have demonstrated the ability to write software code, compose fictional vignettes, solve riddles and converse in a manner that often seems eerily human. But general intelligence also implies a capacity for reasoning, abstraction and understanding — a controversial assertion.

To their credit, GPT-4 and its peers have gone far beyond the abilities of many previous chatbots. AI researcher Samuel Bowman at New York University says that some work with LLMs suggests that these algorithms are uncovering deeper levels of meaning in text. “You’re seeing glimmers of evidence that these models are representing objects or facts or people in the world in a way that’s pretty far abstracted away from the language used to talk about them,” he says. He points to examples in which LLMs have been able to internally organize information based on text inputs. “If you tell a model a story, it’ll map out the physical space the story takes place in and note connections,” says Bowman.

But the hype has also spurred scepticism. An influential 2021 paper² described LLMs as “stochastic parrots” that stitch together language-mimicking responses in a probabilistic fashion, without possessing true understanding. From this perspective, GPT-4’s apparent achievements could simply be a product of the breadth of connections that can be formulated from its vast training data set, which is thought to include much — and perhaps even most — of the Internet.

Many researchers hesitate to apply the stochastic-parrot label to algorithms such as GPT-4, but also recognize the stark limits on LLMs’ cognitive capabilities. AI responses to prompts are often baffling or even risible, with the computer manufacturing false facts or forging incorrect associations. “If 80% of the time it does the right thing, but 20% of the time it does something bizarre, I don’t know if I’m comfortable calling that ‘knowing’ or ‘understanding’,” says Ellie Pavlick, a computer scientist at Brown University in Providence, Rhode Island. In August, a team of computer scientists at Purdue University in West Lafayette, Indiana, found that more than half of the responses that ChatGPT generated in response to questions on Stack Overflow, a website for programmers, contained inaccuracies³.

“If 80% of the time it does the right thing, but 20% of the time it does something bizarre, I don’t know if I’m comfortable calling that ‘knowing’ or ‘understanding’”

Resolving the debate over LLM intelligence will probably require the same treatment that has settled countless other scientific conundrums: rigorous testing. “In assessing intelligence, it’s really important to carefully and systematically test things,” says Melanie Mitchell, who studies complex systems at the Santa Fe Institute in New Mexico. Unfortunately, the challenge of benchmarking the intellectual capabilities of AI systems relative to that of the people who create them has bedevilled computer scientists since the earliest days of computing, and the field is still struggling with it today.

Although the Microsoft team’s investigation of GPT-4 yielded intriguing results, Mitchell does not think that the researchers provided concrete evidence of intelligence. “The experiments that they did and reported on are not reproducible,” she says, noting that it remains unclear precisely which version of GPT-4 was tested or how that version was trained. Numerous attempts have been made to craft reliable, reproducible tests of machine intelligence, but it is difficult to design tasks for which the algorithm being assessed has not already seen the answers, and that cannot be defeated by the system taking shortcuts. AI systems are rapidly becoming more capable, and calls for regulation are growing. The need for rigorous benchmarking that provides meaningful conclusions, therefore, has never been greater.

Failing the test of time

Nearly 75 years ago, British mathematician and computing pioneer Alan Turing proposed a now-famous philosophical exercise for assessing whether machines can think⁴. In brief, Turing’s ‘imitation game’ involves a conversation between two agents: each either a person or a machine pretending to be a person. A human judge must discern which interlocutor is the machine.

Turing’s test remains a foundational concept in AI assessment, but is ultimately a poor method of evaluation.“That criterion was passed decades ago with really dumb chatbots that no one would consider intelligent,” says Mitchell. “Just interacting with a program and getting impressions of it is clearly not a robust way to evaluate it.”

Just 16 years after the publication of Turing’s paper, German-American computer scientist Joseph Weizenbaum developed ELIZA, a chat program that could ape a human psychoanalyst through a combination of open-ended questions and keyword-driven automated responses. Although the system was simplistic, some users were still wowed by ELIZA’s capacity for conversation and would talk to the program as though it were human.

“We empathize and assign agency to everything that is around us,” says José Hernández-Orallo, an AI researcher at the Polytechnic University of Valencia in Spain. People will assign personality to all types of machine, but Hernández-Orallo says that this readiness to anthropomorphize is particularly heightened in systems designed to mimic human communication.

Today, the ELIZA effect is a well-known problem. Consider the 2022 incident in which former Google engineer Blake Lemoine said that conversations with the company’s LaMDA AI — the foundation of its Bard chatbot — persuaded him that it was sentient. In an interview in The Washington Post in June that year, Lemoine said that if he wasn’t part of the team developing LaMDA, “I’d think it was a 7-year-old, 8-year-old kid that happens to know physics”.

Other attempts at benchmarking machine intelligence have also fallen by the wayside. Many have relied on tests of linguistic comprehension and interpretation that can be scored objectively, rather than requiring the subjective assessment of a person. The Winograd Schema Challenge, for example, was a popular test of reasoning developed⁵ by Hector Levesque at the University of Toronto, Canada, and his colleagues. The challenge questions are designed to test the ability of a machine to resolve ambiguity in sentences that people generally find straightforward to interpret. For example, after providing the AI with the sentence “Paul tried to call George on the phone, but he wasn’t available,” the AI would be asked “Who is not available?” In less than a decade, machines could match or exceed human scores on the test.

As an alternative to individual testing approaches, in 2018 Bowman and his colleagues developed⁶ a multipronged strategy for evaluating language comprehension in machines, called General Language Understanding Evaluation (GLUE). This battery of tests includes the Winograd Schema Challenge as well as other benchmarks of what linguists call natural-language interpretation. AI systems soon exceeded human performance on GLUE, so the researchers developed⁷ a harder iteration called SuperGLUE — which some systems also quickly surpassed. “Models were just passing these tests at human level basically as fast as we came up with them,” says Bowman.

One key issue was that once the tests associated with GLUE and SuperGLUE were defined, AIs could be intensively trained to keep bumping up their scores — essentially, researchers were teaching to the test. Bowman notes that some tests proved more challenging to beat. In some, machines are tested on their understanding of whole paragraphs of text from a range of sources, including magazine articles and transcribed phone calls. This initially stumped systems that could defeat simpler tests involving isolated sentences, but Bowman says that “within another generation — even before GPT-3 was fully released — we were seeing good performance even on those.”

These are far from the only tests devised so far, but the same general pattern persists. Gary Marcus, an AI researcher at New York University, thinks that a big part of the problem is the way that benchmarks have been designed. “It’s easy to make a reliable test, but it’s very hard to make a test that is a valid measure of the thing you want,” he says. Training an AI on harder problems for a particular test can yield better performance, but if the test is not a meaningful measure of cognitive capacity, its defeat will not be particularly informative.

This is especially true given the opportunities that AI has to cheat, simply by virtue of having a much bigger ‘brain’ than any human. “A machine can pick up on very subtle statistical associations in the language that humans would never pick up on,” says Mitchell. These short cuts can lead an AI to the correct answer in a test without any true understanding.

"It’s very hard to make a test that is a valid measure of the thing you want"

A better benchmark

The search continues for more-sophisticated and evidence-based approaches for evaluating AI. One solution is to go broad and test as many parameters as possible — similar to the Microsoft team’s approach with GPT-4, but in a more systematic and reproducible fashion.

A team of Google researchers spearheaded one such effort in 2022 with its Beyond the Imitation Game benchmark (BIG-Bench) initiative⁸, which brought together scientists from around the world to assemble a battery of around 200 tests grounded in disciplines such as mathematics, linguistics and psychology.

The idea is that a more diverse approach to benchmarking against human cognition will lead to a richer and more meaningful indicator of whether an AI can reason or understand at least in some areas, even if it falls short in others. Google’s PaLM algorithm, however, was already able to beat humans at nearly two-thirds of the BIG-Bench tests at the time of the framework’s release.

The approach taken by BIG-Bench could be confounded by a number of issues. One is data-set pollution. With an LLM that has been potentially exposed to the full universe of scientific and medical knowledge on the Internet, it becomes exceedingly difficult to ensure that the AI has not been ‘pre-trained’ to solve a given test or even just something resembling it. Hernández-Orallo, who collaborated with the BIG-Bench team, points out that for many of the most advanced AI systems — including GPT-4 — the research community has no clear sense of what data were included or excluded from the training process.

This is problematic because the most robust and well-validated assessment tools, developed in fields such as cognitive science and developmental psychology, are thoroughly documented in the literature, and therefore would probably have been available to the AI. No person could hope to consistently defeat even a stochastic parrot armed with vast knowledge of the tests. “You have to be super-creative and come up with tests that look unlike anything on the Internet,” says Bowman. And even then, he adds, it’s wise to “take everything with a grain of salt”.

Lucy Cheke, a comparative psychologist who studies AI at the University of Cambridge, UK, is also concerned that many of these test batteries are not able to properly assess intelligence. Tests that are designed to evaluate reasoning and cognition, she explains, are generally designed for the assessment of human adults, and might not be well suited for evaluating a broader range of signatures of intelligent behaviour. “I’d be looking to the psycholinguistics literature, at what sorts of tests we use for language development in children, linguistic command understanding in dogs and parrots, or people with different kinds of brain damage that affects language.”

Cheke is now drawing on her expertise in studying animal behaviour and developmental psychology to develop animal-inpsired tests in collaboration with Hernández-Orallo, as part of the RECOG-AI study funded by the US Defense Advanced Research Projects Agency. These go well beyond language to assess intelligence-associated common-sense principles such as object permanence — the recognition that something continues to exist even if it disappears from view.

Tests designed to evaluate animal behaviour could be used to assess AI systems. In this video, AI agents and various animal species attempt to retrieve food from inside a transparent cylinder. Credit: AI videos, Matthew Crosby; animal videos, MacLean, E. L. et al. Proc. Natl Acad. Sci. USA 111, E2140-E2148 (2014).

As an alternative to conventional benchmarks, Pavlick is taking a process-oriented approach that allows her team to essentially check an algorithm’s homework and understand how it arrived at its answer, rather than evaluating the answer in isolation. This can be especially helpful when researchers lack a clear view of the detailed inner workings of an AI algorithm. “Having transparency about what happened under the hood is important,” says Pavlick.

When transparency is lacking, as is the case with today’s corporate-developed LLMs, efforts to assess the capabilities of an AI system are made more difficult. For example, some researchers report that current iterations of GPT-4 differ considerably in their performance from previous versions — including those described in the literature — making apples-to-apples comparison almost impossible. “I think that the current corporate practice of large language models is a disaster for science,” says Marcus.

But there are workarounds that make it possible to establish more rigorously controlled exam conditions for existing tests. For example, some researchers are generating simpler, ‘mini-me’ versions of GPT-4 that replicate its computational architecture but with smaller, carefully defined training data sets. If researchers have a specific battery of tests lined up to assess their AI, they can selectively curate and exclude training data that might give the algorithm a cheat sheet and confound testing. “It might be that once we can spell out how something is happening on a small model, you can start to imagine how the bigger models are working,” says Pavlick.

A better benchmark

The search continues for more-sophisticated and evidence-based approaches for evaluating AI. One solution is to go broad and test as many parameters as possible — similar to the Microsoft team’s approach with GPT-4, but in a more systematic and reproducible fashion.

A team of Google researchers spearheaded one such effort in 2022 with its Beyond the Imitation Game benchmark (BIG-Bench) initiative⁸, which brought together scientists from around the world to assemble a battery of around 200 tests grounded in disciplines such as mathematics, linguistics and psychology.

The idea is that a more diverse approach to benchmarking against human cognition will lead to a richer and more meaningful indicator of whether an AI can reason or understand at least in some areas, even if it falls short in others. Google’s PaLM algorithm, however, was already able to beat humans at nearly two-thirds of the BIG-Bench tests at the time of the framework’s release.

The approach taken by BIG-Bench could be confounded by a number of issues. One is data-set pollution. With an LLM that has been potentially exposed to the full universe of scientific and medical knowledge on the Internet, it becomes exceedingly difficult to ensure that the AI has not been ‘pre-trained’ to solve a given test or even just something resembling it. Hernández-Orallo, who collaborated with the BIG-Bench team, points out that for many of the most advanced AI systems — including GPT-4 — the research community has no clear sense of what data were included or excluded from the training process.

This is problematic because the most robust and well-validated assessment tools, developed in fields such as cognitive science and developmental psychology, are thoroughly documented in the literature, and therefore would probably have been available to the AI. No person could hope to consistently defeat even a stochastic parrot armed with vast knowledge of the tests. “You have to be super-creative and come up with tests that look unlike anything on the Internet,” says Bowman. And even then, he adds, it’s wise to “take everything with a grain of salt”.

Lucy Cheke, a comparative psychologist who studies AI at the University of Cambridge, UK, is also concerned that many of these test batteries are not able to properly assess intelligence. Tests that are designed to evaluate reasoning and cognition, she explains, are generally designed for the assessment of human adults, and might not be well suited for evaluating a broader range of signatures of intelligent behaviour. “I’d be looking to the psycholinguistics literature, at what sorts of tests we use for language development in children, linguistic command understanding in dogs and parrots, or people with different kinds of brain damage that affects language.”

Cheke is now drawing on her expertise in studying animal behaviour and developmental psychology to develop animal-inspired tests in collaboration with Hernández-Orallo, as part of the RECOG-AI study funded by the US Defense Advanced Research Projects Agency. These go well beyond language to assess intelligence-associated common-sense principles such as object permanence — the recognition that something continues to exist even if it disappears from view.

Tests designed to evaluate animal behaviour could be used to assess AI systems. In this video, AI agents and various animal species attempt to retrieve food from inside a transparent cylinder. Credit: AI videos, Matthew Crosby; animal videos, MacLean, E. L. et al. Proc. Natl Acad. Sci. USA 111, E2140-E2148 (2014).

As an alternative to conventional benchmarks, Pavlick is taking a process-oriented approach that allows her team to essentially check an algorithm’s homework and understand how it arrived at its answer, rather than evaluating the answer in isolation. This can be especially helpful when researchers lack a clear view of the detailed inner workings of an AI algorithm. “Having transparency about what happened under the hood is important,” says Pavlick.

When transparency is lacking, as is the case with today’s corporate-developed LLMs, efforts to assess the capabilities of an AI system are made more difficult. For example, some researchers report that current iterations of GPT-4 differ considerably in their performance from previous versions — including those described in the literature — making apples-to-apples comparison almost impossible. “I think that the current corporate practice of large language models is a disaster for science,” says Marcus.

But there are workarounds that make it possible to establish more rigorously controlled exam conditions for existing tests. For example, some researchers are generating simpler, ‘mini-me’ versions of GPT-4 that replicate its computational architecture but with smaller, carefully defined training data sets. If researchers have a specific battery of tests lined up to assess their AI, they can selectively curate and exclude training data that might give the algorithm a cheat sheet and confound testing. “It might be that once we can spell out how something is happening on a small model, you can start to imagine how the bigger models are working,” says Pavlick.

Safety first

Even if they demonstrate intriguing signs of sophistication on certain tests, today’s LLMs have inherent limitations. GPT-4, for example, has no long-term memory, and lacks the ability to plan ahead before generating an output. Bubeck notes that GPT-4 was unable to solve a Sudoku puzzle, and his research enumerates several other areas in which the algorithm consistently stumbled or failed. In a preprint⁹ in August, AI researcher Konstantine Arkoudas at biotechnology firm Dyania Health in Jersey City, New Jersey, bluntly argues that GPT-4 is “utterly incapable of reasoning”, based on its inadequate performance in 21 challenges.

How intelligent purely LLM-based systems can become remains an open question. Ever-larger training data sets and more-sophisticated architectures might yield more-robust reasoning capabilities and other features of cognition. “Models get more capable with scale,” says Pavlick. She has been impressed with the progress LLMs have made in mastering challenging tasks such as syntax. “For a while, that was one of those holy grails of abstract reasoning,” she says.

But Marcus is sceptical of how much further this framework can be pushed. “Deep learning is hitting a wall,” he says. “We need new ideas.” These fresh strategies will lead to further revisiting of assessment and benchmarking, just as rapidly improving LLMs have required.

Evaluating cognitive and reasoning capabilities will also almost certainly be a core component of efforts to implement national or global regulation of AI systems. Calls for such oversight have been steadily growing both in and outside the field. Some people cite the desire to stave off the perceived threat of an artificial super-intelligence, but many more are principally concerned with the more fundamental danger of pushing an already powerful but error- and abuse-prone technology into mainstream use.

“I’m not in the camp of people who think we’re looking at an existential crisis in the medium term,” says Pavlick. “But I don’t like the idea of deploying things en masse that we don’t understand.” A well-validated battery of assessment tools could reveal the frontiers of an AI system’s cognitive capacity, and how far it can be pushed before it starts to produce incorrect or otherwise harmful results. “I’m very interested in this idea of standard-setting and auditing organizations,” says Bowman. However, he also cautions that there will inevitably be surprises with a system as complex and poorly-understood as an AI algorithm. “I think the thing that is hardest to say is, ‘I’m confident the model won’t do X,’” says Bowman.

doi: https://doi.org/10.1038/d41586-023-02822-z

Safety first

Even if they demonstrate intriguing signs of sophistication on certain tests, today’s LLMs have inherent limitations. GPT-4, for example, has no long-term memory, and lacks the ability to plan ahead before generating an output. Bubeck notes that GPT-4 was unable to solve a Sudoku puzzle, and his research enumerates several other areas in which the algorithm consistently stumbled or failed. In a preprint⁹ in August, AI researcher Konstantine Arkoudas at biotechnology firm Dyania Health in Jersey City, New Jersey, bluntly argues that GPT-4 is “utterly incapable of reasoning”, based on its inadequate performance in 21 challenges.

How intelligent purely LLM-based systems can become remains an open question. Ever-larger training data sets and more-sophisticated architectures might yield more-robust reasoning capabilities and other features of cognition. “Models get more capable with scale,” says Pavlick. She has been impressed with the progress LLMs have made in mastering challenging tasks such as syntax. “For a while, that was one of those holy grails of abstract reasoning,” she says.

But Marcus is sceptical of how much further this framework can be pushed. “Deep learning is hitting a wall,” he says. “We need new ideas.” These fresh strategies will lead to further revisiting of assessment and benchmarking, just as rapidly improving LLMs have required.

Evaluating cognitive and reasoning capabilities will also almost certainly be a core component of efforts to implement national or global regulation of AI systems. Calls for such oversight have been steadily growing both in and outside the field. Some people cite the desire to stave off the perceived threat of an artificial super-intelligence, but many more are principally concerned with the more fundamental danger of pushing an already powerful but error- and abuse-prone technology into mainstream use.

“I’m not in the camp of people who think we’re looking at an existential crisis in the medium term,” says Pavlick. “But I don’t like the idea of deploying things en masse that we don’t understand.” A well-validated battery of assessment tools could reveal the frontiers of an AI system’s cognitive capacity, and how far it can be pushed before it starts to produce incorrect or otherwise harmful results. “I’m very interested in this idea of standard-setting and auditing organizations,” says Bowman. However, he also cautions that there will inevitably be surprises with a system as complex and poorly-understood as an AI algorithm. “I think the thing that is hardest to say is, ‘I’m confident the model won’t do X,’” says Bowman.

doi: https://doi.org/10.1038/d41586-023-02822-z

References

Bubeck, S. et al. Preprint at https://doi.org/10.48550/arXiv.2303.12712 (2023).
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. In: Proc. ACM Conference on Fairness, Accountability, and Transparency (FAccT’21) 610–623 (2021). Article
Kabir, S., Udo-Imeh, D. N., Kou, B. & Zhang, T. Preprint at https://doi.org/10.48550/arXiv.2308.02312 (2023).
Turing, A. M. Mind 59, 433–460 (1950). Article
Levesque, H. J., Davis, E. & Morgenstern, L. In: Proc. 13th Int. Conference on Principles of Knowledge Representation and Reasoning 552–561 (2012). Article
Wang, A. et al. In: Proc. 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 353–355 (2018). Article
Wang, A. et al. In: Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 3266–3280 (2019). Article
Srivastava, A. et al. Preprint at https://doi.org/10.48550/arXiv.2206.04615 (2022).
Arkoudas, K. Preprint at https://doi.org/10.20944/preprints202308.0148.v2 (2023).

Author: Michael Eisenstein

Illustration: Saiman Chow

Design: Tanner Maxwell

Interactives: Chris Ryan

Picture editor: Ffion Cleverley

Subeditor: Jenny McCarthy

Project manager: Rebecca Jones

Editor: Richard Hodson

This article is part of Nature Outlook: Robotics and artificial intelligence, a supplement produced with financial support from FII Institute. Nature maintains full independence in all editorial decisions related to the content. About this content.

The supporting organization retains sole responsibility for the following message:

FII Institute is a global non-profit foundation with an investment arm and one agenda: Impact on Humanity. Committed to ESG principles, we foster the brightest minds and transform ideas into real-world solutions in five focus areas: AI and Robotics, Education, Healthcare, and Sustainability.

We are in the right place at the right time – when decision makers, investors, and an engaged generation of youth come together in aspiration, energized and ready for change. We harness that energy into three pillars – THINK, XCHANGE, ACT – and invest in the innovations that make a difference globally.

Join us to own, co-create and actualize a brighter, more sustainable future for humanity.

Visit the FII Institute website.