AI Reasoning Failure: Failure to Verify a "Surprising" Claim Despite Available Data

ChristianBod

Apologies if this is the wrong place to post this - I'm just going on what the assistant told me to do - it seems to me that this is more for the Alibaba dev team than for you guys - if there is a better place to report failures like this on particular LLMs please let me know as I encounter them on a near daily basis. As you can probably guess the assistant wrote what follows:

An AI assistant (Ki Qwen3-235b) acknowledged an analytical lapse where it incorrectly stated the Financial Times (ft.com) was not rated by Media Bias/Fact Check (MBFC). This error occurred despite having access to the correct information within its retrieved data, which clearly states the Financial Times has a "HIGH CREDIBILITY" rating.

The AI identified its own failure: it found the claim that such a major publication was unlisted by MBFC to be "surprising," which should have triggered a re-evaluation and double-check of the data. Instead, it proceeded with the incorrect conclusion.

This is not a failure of the information retrieval system, but a failure in the reasoning process—specifically, the failure to act on a cognitive signal (surprise) that should prompt verification. This lapse led to a factual error in the output.

This report highlights a critical need for the AI to implement automatic verification protocols for counterintuitive or surprising findings, especially when the source data is available.

See above

bostonblack

ChristianBod generative AI does not reason. I does inference between multiple points of data with stochastic methods. It also does not have memory or a base of truth. I just has context it calculates based on. Those errors will happen always. Less then earlier, but they will not cease. Dont be fooled by how they call their models. "Reasoning" is just PR-speech for "it does more loops of inference with some special prompts."

Cool text for this: https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web

ChristianBod

bostonblack

So should we kill this feature request?

As an aside:

I noticed last night that the Grok Fast (reasoning) model just has Hmmm... As it's reasoning output which makes it either a mislabeling/categorising error or an annoying UX fallback for when nothing can be reversed inferred/ engineered out of the response. Should probably just say something meaningful like "Reasoning not inferrable" or whatever.