AI chatbots now analyze language like trained linguists, UC Berkeley research shows
New research from UC Berkeley suggests advanced AI chatbots can reflect on and analyze language in ways that challenge long-held beliefs about human cognition.
Researchers at the University of California, Berkeley have tested how far large language models (LLMs) have advanced in analyzing and reasoning about language. The study, soon to be published in IEEE Transactions on Artificial Intelligence, suggests that some AI systems can now demonstrate “metalinguistic ability,” a skill previously considered a hallmark of human cognition.
The research was led by Gašper Beguš, an associate professor of linguistics at UC Berkeley. Beguš says that the ability to think about, manipulate, and discuss language structure, known as metalinguistics, has long been seen as uniquely human.
“Our new findings suggest that the most advanced large language models are beginning to bridge that gap,” says Beguš. “Not only can they use language, they can reflect on how language is organized.”
Testing complex linguistic concepts
Beguš and his team fed 120 complex sentences into multiple AI systems, including versions of OpenAI’s ChatGPT (3.5, 4, and o1) and Meta’s Llama 3.1. The models were asked to analyze the sentences, assess specific linguistic properties, and create syntactic trees—visual diagrams used by linguists to map sentence structures.
One example sentence was “Eliza wanted her cast out.” The team wanted to know if the AI could detect the ambiguous meaning: Did Eliza want someone expelled, or did she want a medical cast removed? ChatGPT 3.5, 4, and Llama failed to recognize the ambiguity. OpenAI’s o1 model, which is designed for more complex reasoning, both identified the ambiguity and produced an accurate syntactic diagram.
The team also tested the concept of recursion, a feature of human language described by Noam Chomsky as the ability to embed phrases within other phrases, leading to potentially infinite sentence nesting. In the sentence “The dog that chased the cat that climbed the tree barked loudly,” recursion is present in the nested clauses.
To evaluate recursion, researchers asked the models to identify whether a sample sentence included it, explain which type was used, and extend the sentence with another recursive clause. When prompted with “Unidentified flying objects may have conflicting characteristics,” OpenAI’s o1 detected the recursion—“flying” modifies “objects,” while “unidentified” modifies “flying objects”—diagrammed the structure, and extended it to: “Unidentified recently sighted flying objects may have conflicting characteristics.”
AI models challenge human-only language traits
The researchers concluded that OpenAI’s o1 significantly outperformed the other tested models.
“This is very consequential,” says Beguš. “It means in these models, we have one of the rare things that we thought was human-only.”
The study adds evidence to the debate about whether AI truly “understands” language or merely imitates it. Beguš says the methods his team used could serve as a benchmark for evaluating AI models and separating technical reality from industry hype.
“Everyone knows what it’s like to talk about language,” says Beguš. “This paper creates a nice benchmark or criterion for how the model is doing. It is important to evaluate it scientifically.”