How could AI get a real world model?
The weaknesses of current generative AI models is often attributed to their lack of independent knowledge of the world. Where would this knowledge come from and how could the AI model acquire it? We look at the latest developments in 'physical AI'.
In the previous post in this series we looked at the limitations of today's large language models (LLMs). We noted that, however large the corpus of language data used to train them, their lack of independent knowledge about the world would always lead to a fundamental unreliability manifested in the well-known phenomenon of 'hallucinations'.
Chessboards and floorplans
For critics of LLMs the solution to this problem is to teach AI engines a 'world model' - a technique that belongs to a more traditional approach to AI.
Examples of this kind of AI are the models that drive chess computers or robot vacuum cleaners like the Roomba. These systems consist of a simulated model of the world and rules for manipulating the elements of the model. In the case of the chess computer the ‘world’ consists of the chess board and the rules of the game. The Roomba’s world consists of some simple rules for movement and the map it builds up of the rooms during its first attempts to navigate the house. These simple frameworks allow the AI to plan actions and predict results.
Feedback and learning
Although they are very simple examples compared with the kind of world model that a human-level AI would need, they illustrate some key features. The Roomba’s world model is continuously updated by feedback from its environment – especially at the beginning of its deployment. The chess computer must continuously update its map of the pieces on the board and modify its predictions as the game proceeds.
They differ from general intelligence models mainly in having smaller domains of activity and predefined objectives (Checkmate, Clean house).
Building up a human-level world model
While the workings of these examples are simple to understand, it is no trivial matter to scale up from the confines of a chess-board or the floor-plan of a house to a complete model of the world. As usual, the devil is in the details.
Relatively few developers are working in this area, compared with the massive research efforts unleashed on ChatGPT and other generative models.
Nevertheless, behind the scenes, world model-based systems are being developed. They include Google’s Genesis platform and the COSMOS model developed by AI chip specialist Nvidia.
(A good overview of these kinds of system can be found here).
Perhaps the most articulated of these solutions is the JEPA model, created by veteran AI developer Jann LeCun (winner of the Turing Award in 2019 and, until recently, chief AI scientist at Meta). We’ll look at this example to explore the approaches and challenges involved in developing a human-level world model.
What would a complete world model include?
If prediction and planning are essential to effective action and they depend on a reliable model of the world, what kinds of knowledge would a human-level world model have to include?
A fundamental requirement would be the physical properties of objects and the way they interact. Gravity, mechanical forces and causality in general would be a part of an elementary understanding of the world.
LeCun has pointed out that this kind of basic knowledge is part of the ‘common sense’ possessed by any human child or even an animal such as a cat. And yet such knowledge is beyond even the most advanced generative AI models.
“But why aren’t those systems as smart as a cat? A cat can remember, can understand the physical world, can plan complex actions, can do some level of reasoning—actually much better than the biggest LLMs. That tells you we are missing something conceptually big to get machines to be as intelligent as animals and humans.”
Such knowledge would obviously be essential for any AI model intended to plan and carry out actions in the physical world.
Where does our physical world knowledge come from?
The question of why it is easier to equip computer models with sophisticated linguistic abilities than to teach them elementary physical skills is known as Moravec’s Paradox and was raised by the robotist Hans Moravec in 1988:
"It is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility".
Moravec’s explanation is that our higher-level cognitive abilities are a comparatively recent development, while our physical skills are the fruit of a much longer evolutionary process. Although they seem instinctive and basic, this apparent simplicity masks an enormously complex skill set.
Encoded in the large, highly evolved sensory and motor portions of the human brain is a billion years of experience about the nature of the world and how to survive in it.
Whatever its origin, it is this physical knowledge that must form part of the capabilities of any AI model if it is to act in the real world.
Training an AI in basic physical skills
So how can we give an AI model the knowledge it needs to understand and navigate the physical world? LeCun’s suggestion is to use the same source that a human child uses: observation of objects in the world and their interactions.
In a Harvard address LeCun points out that, in its first four years of life, a child receives an amount of visual information that exceeds the volume of data used to train the largest modern LLMs.
Video as training data
Given that visual experience is the basis for human physical knowledge, AI practitioners like LeCun have explored the possibility of using the large amounts of video material available online to train AI models to understand and act in the physical world.
This approach is analogous to that used to train LLMs:
A neural network, capable of ‘learning,’ is presented with the first part of a sentence and must supply the next word. The ‘correct’ answer is then revealed and the network adjusts its parameters accordingly. After training on billions of sentences from a huge corpus of text, the model becomes very good at predicting the continuation of sentences.
Adapting this method to video material involves asking the network to predict the next part of a video sequence and adjusting the model accordingly after each attempt.
The representation issue
But applying the method to video material runs into an immediate problem. If we ask a model to predict the next frame in a video sequence, what should we ask it to provide? To require it to produce each individual pixel would make no sense, as well as being computationally unfeasible. Even asking it to generate individual objects is problematic – do we need it to predict the position of every leaf on a tree, for example?
In fact, before video frames can be used to train physical AI models, they need to be analysed into elements at an appropriate level of detail.
A predictive architecture
For LeCun the most effective level of detail is an abstract, conceptual kind of representation that corresponds to a child’s understanding of the objects in the world and their behavior. To allow a model to develop such an understanding, LeCun’s team have developed a training framework that they call a Joint Embedding Predictive Architecture (JEPA).
The latest iteration of this model, V-JEPA 2, has been trained on a million hours of internet video and images. According to the team, the model performs well on understanding video sequences, answering ‘what happens next?’ questions and can even be used to program robots to work in unfamiliar conditions:
Building on the ability to understand and predict, V-JEPA 2 can be used for zero-shot robot planning to interact with unfamiliar objects in new environments.
A lean machine?
According to recent accounts, the JEPA approach is more economic than that of the latest generation of LLMs. It consists of 1.8 billion parameters (compared with the 2 to 5 trillion of GPT 5, for example) and the time and hardware requirements for its training are far smaller than those required for the latest ‘frontier’ models.
Supporters of the approach attribute this economy to the leanness of the model compared with the ‘brute force’ of the largest GenAI engines. They see it as an alternative to the massive energy and infrastructure requirements driven by current trends in generative AI, which may turn out to be unsustainable.
How long to human-level intelligence, then?
Although he sees the world-model approach as being on the right road towards human-level intelligence (as opposed to LLMs which he regards as ‘an off-ramp’) LeCun himself puts the achievement of human-level intelligence as several years away at least.
“… reaching human-level AI “will take several years if not a decade. [...] But I think the distribution has a long tail: it could take much longer than that.”
A recent review of the JEPA model recognizes its achievements but also notes its limitations:
Despite its strengths, V-JEPA 2 still faces challenges. It struggles with long-term planning—it can predict the next few seconds, but not carry out complex, multi-step tasks.
It seems fair to conclude that the approach, while promising, is still at an early stage and considerably more work will be required to develop a model capable of performing at a human level.
What about a hybrid model?
But the development of world model-based AI engines of the kind explored here does not mean that today’s generative AI models will become obsolete. On the contrary, many AI practitioners are looking forward to ‘neuro-symbolic’ AI solutions combining the strengths of each of the two approaches.
These would apply to the vast language data of today’s LLMs the superior reliability and reasoning of the emerging world model-based (symbolic) systems.
In the simplest configuration, the symbolic AI would act as ‘fact-checker’ for the LLM, verifying the responses created by the language model against its knowledge of the world and its rigorous deductive capabilities.
In the next article in this series, we will look in more detail at the current state of neuro-symbolic AI models and their potential applications.
Meanwhile GenAI does useful work in the newsroom
In the creation and delivery of news content, AI assistants are adding value by taking care of routine, time-consuming tasks, freeing authors and editors to concentrate on creating engaging content. The latest AI assistants can 'package' a story in a few seconds to create headlines, summaries, subheadings, SEO elements and social posts.
The arrival of the ProActions framework now allows users to create their own AI tools from multiple models, optimizing their integration into the workspace and maximizing productivity. One recent development generates the entire suite of story elements detailed above with a single click, significantly cutting time-to-market for breaking news stories.
Find out more about the ProActions AI framework.
For print publishers, the layout of print pages is still a major cost item. Eidosmedia customers in Germany are now using an integrated AI engine to cut the layout for a print edition from several hours to a few minutes.
Find out more about AI print pagination.
