February 21, 2024 , in technology

By David Baker

Is Generative AI a Dead End?

While some claim that generative AI models like ChatGPT are a big step along the road to human levels of general intelligence, others see the technology as on the wrong path entirely.

Eidosmedia Generative AI

Generative AI | Eidosmedia

Just over a year ago the first generative AI model ChatGPT was made available for public use, unleashing a storm of enthusiasm and alarm.

A useful tool …

Since then, the models’ impressive language abilities have begun to be harnessed in many applications from internet search to customer support.

In areas like the news-media and financial services sectors, where creation and delivery of quality content is critical, generative AI is augmenting - but not replacing - human authors and editors with services like automatic semantic tagging, summary and title generation, as well as sophisticated language and style control.

… but is it the real thing?

But, as the practical applications of generative AI continue to evolve, a theoretical debate has begun about what kind of beast generative AI is and what it might be leading to.

On one side of this discussion are those (often the developers themselves) who hold that the current crop of AI models is a significant step forward towards the Holy Grail of AI development, usually described as ‘Artificial General Intelligence’ - AGI.

Earlier last year a Microsoft research team published a paper, entitled Sparks of Artificial Intelligence, containing the claim:

“Given the breadth and depth of GPT-4’s capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.” 

More recently, in an article with the bold title Artificial General Intelligence Is Already Here, a senior Google AI researcher wrote:

“Today’s most advanced AI models have many flaws, but decades from now, they will be recognized as the first true examples of artificial general intelligence.” 

Just an off-ramp?

Other AI practitioners (for the most part those not involved in developing generative models) have greeted these claims with varying degrees of scepticism. Perhaps the most succinct expression is the tweet by Yann LeCun, chief AI scientist at Meta:

“On the highway towards Human-Level AI, Large Language Model is an off-ramp.” 

What LeCun is saying here is not that current AI language models are not yet sufficiently powerful or refined to reach human levels of general intelligence. He is saying that the approach can never, even in theory, lead to the kind of intelligence required.

An important question

The dispute may seem an academic question with little practical importance, but, as the EU prepares legislation to govern the use of AI technology, the potential and future destination of generative AI has a bearing on the way it should be regulated.

It also lies behind the ethical debate that recently caused turmoil at OpenAI, currently leading the drive to commercialize generative AI technology. In fact, OpenAI’s founding charter explicitly names the realization of general AI as part of the company’s mission:

OpenAI’s mission is to ensure that artificial general intelligence (AGI)—by which we mean highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity. 

And, for everyone whose life and work will be impacted by this technology, it is important to have an idea of where it may be going and what it may be capable of.


So what is Artificial General Intelligence?

Before going on to look at the merits of the question, it may useful to have an idea of what is meant by AGI.

The OpenAI charter, quoted above defines it as:

“… highly autonomous systems that outperform humans at most economically valuable work.” 

Another widely accepted definition is:

 ‘A computer system that can equal or surpass humans across the full spectrum of intellectual activity’. 

While these definitions are necessarily imprecise, the most important property of the systems defined is ‘General’.

Putting the G in AGI

This is because, for most of the history of AI, the systems developed were necessarily specialized: they were programmed or trained to work in a restricted domain – chess computers, language translators, industrial robots etc. There was no system that could be trained easily to work in multiple domains as a human subject could. This generalized ability was therefore seen as a defining feature of a truly intelligent system, approaching or surpassing human capabilities.

When the latest generation of AI models came along, one of their most remarkable features from an AI perspective was their versatility. Following an initial training, large language models (LLMs) like ChatGPT could handle a seemingly unlimited range of language tasks. Without specific training, they have also performed well even on tough specialized tasks like the American Bar exam and SAT tests.

In this sense, the transformer technology behind these models has certainly put the G in AGI. It’s therefore understandable that this generalizable intelligence has been taken as a sign that it’s only a question of time before generative AI delivers the full range of AGI capabilities.

But those unwilling to concede that generative AI is going to lead to AGI point out that, for all their impressive language feats, generative AI models lack a number of essential features that have traditionally been considered fundamental to AI systems.

Language is not the world

The first of these is what may be called ‘knowledge of the world.’

In an earlier post - Masters of mimicry and hearsay: how ChatGPT really works - we looked at the rather simple mechanisms behind the training of large language models (LLMs). We saw that the apparent ‘knowledge’ of the world possessed by these models is based on the statistical distribution of words in what others have said or written. The models have no independent source of information against which they can check the content of their training data.

In an important sense, then, a generative AI model trained in this way knows nothing about the world. It has statistical data about the co-occurrence of elements in what other people have said or written and this often allows it to make a correct connection – but this is not the same as having knowledge of the world.

This deficiency lies behind the well-known tendency of LLMs to ‘hallucinate’ – i.e. to produce apparently factual content that is wholly or partially false.

Given the massive amount of language that is used to train the model, it’s statistically likely that, for many prompts and tasks, enough true correlations exist to allow it to respond correctly. But these are just statistical correlations depending on the training data – not ‘hardwired’ propositions about the real world.

Mixed-up maps

One way of probing a language model’s knowledge of the world is to ask it to produce a map – say of cities in North America. This typically results in some very peculiar geographies, with cities located in the middle of oceans, or widely distant from their real locations. The reason for this is that, although there may be many references to the location of cities in the training corpus, they are not sufficiently numerous or precise to allow an exact mapping of their positions.

Cheating at chess

A similar problem arises when an LLM plays chess. Because it has digested many examples of chess games an LLM is capable of playing at a reasonable level. But, unlike a dedicated chess program, every now and then it makes a completely illegal move showing that its proficiency, based as it is on a probabilistic analysis of games in the training data, does not include a complete knowledge of the rules.

Language is not logic

A similar consideration applies to LLM’s poor capacity to to reason logically or mathematically.

Computers have traditionally been able to perform logical deductions and mathematical calculations faster and more accurately than humans, so we might expect the performance of LLMs on maths or logic problems to be particularly good. In reality they are fairly mediocre, often making elementary errors in arithmetic or deduction.

This is due to that fact that their deduction or calculation is based on a statistical distribution of relations between terms in the training data. This leads them to choose the most likely answer, but not necessarily the correct one - because the training data may be incorrect or incomplete.

The following example comes from a recent paper, Talking about Large Language Models , by Murray Shanahan, Professor of Cognitive Robotics at Imperial College London.

“If we prompt an LLM with “All humans are mortal and Socrates is human, therefore”, we are not instructing it to carry out deductive inference.

Rather, we are asking it the following question: Given the statistical distribution of words in the public corpus, what words are likely to follow the sequence ‘All humans are mortal and Socrates is human therefore”.

A good answer to this would be “Socrates is mortal”.”


Again, the model is giving us a probable answer based on the distribution of words in what others have said or written. It is not computing the answer in the way a hard-coded theorem-proving algorithm would.

In fact, in recent testing of LLMs like ChatGPT and Bard using batteries of tests designed to probe various types of reasoning ability, none of the models achieved a perfect score on any of the tests, mostly achieving results below average human performance levels.

The authors also presented evidence that performance of models on such tests improves with their size – a further indication that correct responses are the product of statistical – i.e. ‘chance’ – processes.


Mediocre mathematics

LLMs have turned out to be no better at math than they are at logical reasoning, often producing elementary errors in simple arithmetical tasks. Recognizing this limitation, researchers at Microsoft have developed a prompt generator designed to improve models’ ability to handle maths problems. While it certainly improved the models’ initial poor performance, the best score obtained by the most successful prompt generator was 92.5%

These examples all result from what critics of generative AI models regard as a fundamental weakness in their way of dealing with rigorous, rule-based disciplines: being based on statistical constructs rather than ‘hard-wired’ elements and rules, the models will always incorporate a degree of uncertainty in their output that no amount of training or ‘prompt engineering’ can eliminate.

In a recent article Yann LeCun puts it with his usual force:

“A system trained on language alone will never approximate human intelligence, even if trained from now until the heat death of the universe.” 

The missing pieces

These results indicate that current generative AI models, in spite of their impressive linguistic performance, flexibility and even apparent creativity, are lacking something which many would regard as essential to ‘real’ human-level intelligence.

What exactly are these missing elements? Do other types of AI have them? Could they be added to generative AI to bring it closer to AGI?

In the next post we will attempt to answer these questions by looking more closely at the technologies underlying today’s generative AI and its ‘traditional’ predecessors.

Meanwhile – generative AI does useful work

While the theoretical debate on the exact nature and potential of generative AI continues, these models are significantly improving productivity in sectors from news media to financial services.

By taking care of complex or routine tasks that take up the time of authors and editors, the models free up human resources to concentrate on adding value and building customer engagement.

Find out how generative AI technology is enhancing productivity in the news-media sector.



Find out more about Eidosmedia products and technology.