AI models fail on complex reasoning tasks

amsterdam, vrijdag, 31 oktober 2025.
Recent research shows that large language models and reasoning models fail catastrophically on complex reasoning tasks, despite performing well on simple assignments. These findings have important implications for the use of AI in journalism, science and other fields where deep reasoning is essential.

AI models fail on complex reasoning tasks — what new research shows

New research concludes that large language models (LLMs) and specialised reasoning models (LRMs) perform well on simple reasoning tasks but fail abruptly and catastrophically once problem complexity rises above a modest threshold [1]. The study introduces a scalable, synthesizable test corpus (DeepRD) and demonstrates that LRM performance drops sharply on graph-connectivity and natural-language proof-planning tasks as complexity increases [1]. These findings were published last Saturday and constitute an important warning regarding claims about the general reasoning capacity of current LLM systems [1].

How the technology behind journalistic AI applications works

Journalistic AI applications often build on LLMs or derivatives such as LRMs, combined with retrieval-augmented generation (RAG) and multimodal pipelines to link documents, images and metadata for fact-checking, summarisation and production automation [2][1]. In practice, models are fine-tuned and sometimes given chain-of-thought or self-verification incentives to encourage stepwise argumentation — an approach that in theory should improve transparency of reasoning, but in controlled tests does not guaranteedly generalise to harder reasoning patterns [1][2].

Concrete use in the newsroom: from fact-checking to investigative journalism

In newsrooms AI systems are used for rapid document extraction, automatic transcription and first drafts of background articles, and as assistants in fact-checking by searching for and ranking relevant sources via RAG-like systems [2]. For day-to-day news production many tasks fall within the success domain LRMs currently handle, but long-term investigative projects that combine multiple, deeply integrated information sources (multi-hop reasoning) face risks when that reasoning reaches the scale and complexity at which LRMs fail [1][3][2].

Benefits for news production and consumption

AI can speed up news production by automating routine tasks (transcription, summarisation, metadata extraction) and free scarce editorial hours for in-depth analysis; some systems also demonstrate significant efficiency gains in large-scale document conversions and batch processing of PDFs and images [2]. Moreover, multimodal and retrieval-based systems increase the scale at which newsrooms can search and combine sources, which can speed up news delivery and enable data-driven publications [2].

Risks and limitations — where it can go wrong

The abrupt failure of LLMs/LRMs as reasoning complexity increases means investigative stories that require multi-hop evidence or long chains of causal inference are vulnerable to invisible errors and misleadingly confident model statements [1][3]. Automatic summaries or legal/medical analysis assistants that operate beyond the training complexity range can present flawed reasoning as plausible conclusions — a specific risk the arXiv study highlights as the ‘long tail’ of real-world cases outside the success regime [1].

Ethical considerations and accountability in editorial use

Ethics in journalistic use requires transparency about when and how AI was employed, clear human ultimate responsibility, and systematic verification of model outputs — especially when reasoning and evidence are crucial to a story’s reliability [3][1]. In addition, the limited generalisation capability of models raises questions around publication policies, correction mechanisms and communicating uncertainty to readers [3][1].

Practical recommendations for newsrooms

Newsrooms deploying AI should (a) classify tasks by reasoning complexity and limit AI use to tasks within proven success domains, (b) employ RAG workflows and external retrievals with human verification for multi-hop claims, and (c) run stress tests on harder reasoning patterns before publishing AI outputs unchanged [2][1][3]. Where there is uncertainty about model capability in a specific case, that uncertainty should be explicitly reported to the public [alert! ‘uncertainty because model generalisation is not guaranteed beyond trained complexity, see study’] [1][3].

Sources

All research findings and technical claims mentioned in the article are based on recent papers and survey platforms from the AI literature and publication sites [1][2][3].