Tech News

Researchers question AI’s ‘thinking’ ability as models stumble over mathematical problems with little change

How do machine learning models do what they do? And do they really “think” or “reason” in the way we understand those things? This is as much a philosophical question as it is a practical one, but a new paper making the rounds on Friday suggests that the answer, at least for now, is a clear “no”.

A group of AI research scientists at Apple released their paper, “Understanding the limits of mathematical reasoning with large language models,” in a public comment Thursday. Although the deep concepts of symbolic learning and pattern generation are less in the weeds, the basic idea of ​​their research is easy to understand.

Let’s say I asked you to solve a math problem like this:

Oliver picks 44 kiwis on Friday. Then he picked up 58 kiwis on Saturday. On Sunday, he chooses twice the kiwi number he did on Friday. How many kiwis does Oliver have?

Obviously, the answer is 44 + 58 + (44 * 2) = 190. Although the big language models are bad at arithmetic, they can solve something like this. But what if I throw in more random information, like this:

Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he chooses twice the number of kiwis he did on Friday, but five of them were smaller than average. How many kiwis does Oliver have?

It’s the same math problem, right? And of course even a grade school student will know that even a small kiwi is still a kiwi. But as it turns out, this additional data point confuses even the most advanced LLMs. Here’s what GPT-o1-mini captured:

… on Sunday, 5 kiwis were smaller than average. We need to subtract them from Sunday’s number: 88 (Sunday kiwis) – 5 (small kiwis) = 83 kiwis

This is just a simple example of the hundreds of questions that researchers have solved a little bit, but almost all of them have led to a huge drop in success rates for the models they tried.

Photo credits:Mirzadeh et al

Now, why should it be so? Why can a model that understands a problem be so easily discarded by random, irrelevant details? The researchers suggest that this reliable approach to failure means that the models do not understand the problem at all. Their training data allows them to respond with the correct answer in some cases, but as soon as real “thinking” is required, such as counting small kiwis, they start producing strange, incomprehensible results.

As the researchers put it in their paper:

[W]investigate the weaknesses of the statistical reasoning in these models and show that their performance decreases significantly as the number of query categories increases. We think this decline is because current LLMs cannot think logically; instead, they try to replicate the reasoning steps seen in their training data.

This observation is consistent with other attributes that are commonly referred to as LLMs due to their linguistic structure. If, mathematically speaking, the phrase “I love you” is followed by “I love you too,” LLM can easily repeat that — but it doesn’t mean he loves you. And while it can follow the complex chains of thought it has been exposed to before, the fact that this chain can be broken by even a superficial deviation suggests that it is not actually thinking as much as it has seen in its training data.

Mehrdad Farajtabar, one of the co-authors, analyzes the paper very well in this series on X.

An OpenAI researcher, while praising the work of Mirzadeh et al, disputed their conclusions, saying that optimal results could be achieved in all these failure cases with a little rapid engineering. Farajtabar (in response to the common yet admirable friendship researchers often employ) noted that while better recognition may work for simple deviations, the model may need more contextual data to cope with more complex disturbances – those, again, that the child can’t pinpoint. outside.

Does this mean LLMs don’t think? It is possible. That they can’t think? Neither knows. These are not well-defined concepts, and questions often arise on the bleeding edge of AI research, where the state of the art changes daily. Maybe LLMs “think,” but in a way we don’t yet realize or know how to control.

It makes for an interesting frontier in research, but it’s also a cautionary tale when it comes to how AI is marketed. Can it really do the things they say, and if it does, how? As AI becomes an everyday software tool, this type of question is no longer relevant.


Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button