Generative AI is Not Capable of Reasoning: Study
Apple’s artificial intelligence researchers have discovered that language models from companies like Meta and OpenAI struggle with basic reasoning. These models, often used in chatbots and other applications, frequently provide inconsistent answers to similar questions.
New Test: GSM-Symbolic
The researchers have created a new test, called GSM-Symbolic, to measure how well these models can reason. Their experiments show that even small changes in the way a question is asked can lead to drastically different responses.
To test how well these models can handle mathematical reasoning, the researchers added extra information to their questions, information easily understandable by humans. This information shouldn’t have changed the correct answer, but it did for many of the models. This shows that these models are still unreliable when it comes to complex reasoning tasks.
Study Findings
The research team wrote in its report:
“Specifically, the performance of all models declines [even] when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, the fragility of mathematical reasoning in these models [demonstrates] that their performance significantly deteriorates as the number of clauses in a question increases.”
The study showed that adding even a single sentence to a math problem, even if it seems helpful, can make the answer up to 65% less accurate. The researchers concluded that it’s impossible to create reliable AI agents based on these models because small changes to the question can lead to very different answers.
Example Scenario
One question, for instance, asked the AI model to solve a simple word problem that an elementary student in school would encounter. The query said:
“Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday.”
The question asked how many kiwis Oliver had in the end. Then, it added a detail about some kiwis being smaller than average. This detail shouldn’t have affected the answer, but both OpenAI’s model and Meta’s Llama3-8b subtracted the smaller kiwis from the total