Tech News

Apple Engineers Show How Flimsy ‘Consultative’ AI Can Be

For a while now, companies like OpenAI and Google have been touting advanced “thinking” capabilities as the next big step in their latest artificial intelligence models. Now, however, a new study from six Apple engineers shows that the mathematical “reasoning” presented by advanced linguistic models can be too short-sighted and unreliable when considering serious changes in common measurement problems.

The weaknesses highlighted in these new results help support previous research that suggests that LLMs’ use of probabilistic pattern matching lacks a formal understanding of the basic concepts needed to acquire truly reliable mathematical reasoning skills. “Current LLMs cannot think logically,” the researchers reasoned based on these results. “Instead, they try to replicate the cognitive steps seen in their training data.”

Mix it up

In “GSM-Symbolic: Understanding the Limitations of Mathematical Reasons in Large Language Models”—currently available as a preprint paper—six Apple researchers started with the GSM8K standard set of more than 8,000 school-grade math word problems, which are commonly used. as a stand-in for the complex thinking skills of the modern LLM. Then they take the novel approach of changing part of that test to replace certain words and numbers with new values—so a question about Sophie getting her niece 31 building blocks in GSM8K would be a question about Bill getting 19 building blocks. brother to the new GSM-Symbolic tests.

This approach helps avoid any “data contamination” that could result from GSM8K static queries being fed directly into the AI ​​model’s training data. At the same time, these risk variables do not change the actual complexity of the inherent mathematical reasoning, which means that the models should work well in theory when tested on GSM-Symbolic like GSM8K.

Instead, when researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic, they found relative accuracy reduced across the board compared to GSM8K, with a performance drop of between 0.3 percent and 1 percent. -9.2, depending on the model. The results also showed high variability across 50 different runs of GSM-Symbolic with different names and values. Gaps of up to 15 percent accuracy between the best and worst runs were common within a single model and, for some reason, changing the numbers often resulted in worse accuracy than changing the names.

This kind of difference-both within the different runs of GSM-Symbolic and compared to the results of GSM8K-is more than a little surprising since, as the researchers point out, “the logical steps required to solve the question remain the same.” The fact that such small changes lead to such variable results suggests to the researchers that these models are not doing any “systematic” thinking but instead are “attempted”.[ing] to perform a kind of distribution pattern matching, matching given questions and steps to solve the same observed in the training data.”

Don’t be disturbed

However, the overall differences shown by the GSM-Symbolic tests were often small in the grand scheme of things. OpenAI’s ChatGPT-4o, for example, dropped from 95.2 percent accuracy on GSM8K to a still impressive 94.9 percent on GSM-Symbolic. That’s a high success rate using either benchmark, regardless of whether the model itself uses “formal” reasoning behind the scenes (although the overall accuracy of most models drops quickly when researchers add one or two more logical steps to problems. ).

The tested LLMs got worse, however, when Apple researchers modified the GSM-Symbolic benchmark by adding “seemingly relevant but ultimately useless statements” to the questions. In this “GSM-NoOp” (short for “no operation”) benchmark set, the question about how many kiwis a person picks on most days could be modified to include the risk information of “five of them [the kiwis] they were smaller than average.”

Adding to these red herrings led to what the researchers called a “catastrophic performance drop” in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested. This large drop in accuracy highlights the inherent limitations in using “pattern matching” to “turn statements into functions without actually understanding their meaning,” the researchers wrote.


Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button