Humans have a remarkable ability to generalize information, learning about concepts and their relationships through diverse experiences, which is foundational to how we interact with the world. From a young age, we begin to develop the cognitive ability to apply learned concepts across different contexts. For instance, a toddler may first learn the color red by being shown a variety of red objects—such as a red ball, red rose, and red truck. With this learning, they can then recognize red tomatoes without needing to see one before. This sophisticated learning pattern, known as compositionality, is the foundation for a variety of human cognitive processes and has implications not just in the fields of developmental neuroscience, but also in artificial intelligence (AI) research.
Compositionality involves breaking complex wholes down into reusable parts. For example, the concept of “redness” can be considered an abstract, reusable part that forms a basic component of recognizing and distinguishing red objects. Understanding how humans develop this ability can significantly inform the development of AI systems capable of generalizing knowledge across different contexts, similar to how children develop comprehension and linguistic skills through their early experiences.
Early neural networks, which later evolved into the expansive and advanced large language models (LLMs), were inspired by how the human brain processes information. The primary aim of these networks initially was to mimic aspects of the human brain, such as pattern recognition and language understanding. However, as these models became more sophisticated, they too started to become increasingly difficult to interpret. Today’s models are often made up of billions and even trillions of tunable parameters, making it almost impossible to fully understand how information is processed within them. However, new research is breaking new ground in making these models more transparent and interpretable.
One of the significant milestones in the study of compositionality in AI came with a new model created by the Cognitive Neurorobotics Research Unit at the Okinawa Institute of Science and Technology (OIST). This model, published in the prestigious journal Science Robotics, is based on a new architecture that provides insights into how compositionality can be achieved in neural networks, opening the black box of machine learning models.
Introducing the New Model
Unlike typical deep learning models, such as the large transformers underpinning modern LLMs, this new model, a Predictive Coding inspired Variational Recurrent Neural Network (PV-RNN), simulates a much closer embodiment of human cognitive processes. The researchers developed this model with the intention of enhancing the understanding of compositionality, particularly in language acquisition—a process that is of great interest to cognitive scientists studying the development of language and action in children.
Rather than relying solely on large datasets like traditional neural networks, the PV-RNN learns through embodied interaction—integrating different sensory inputs. These inputs include visual data from a robot’s movements (e.g., a video of a robotic arm moving colored blocks), proprioception (the sense of movement of the arm’s joints), and language instructions such as “place red on blue.” By integrating these three distinct types of sensory input—vision, proprioception, and language—the system allows for a deeper, more holistic learning process, resembling how humans experience and interact with the world.
This learning framework is grounded in predictive coding, a theory that aligns with how the brain processes sensory information. The Free Energy Principle, part of this theory, suggests that the brain continuously predicts sensory inputs based on prior experiences and actions. When there’s a discrepancy between prediction and actual input, this “free energy” is minimized through actions that reconcile the prediction with reality. In simpler terms, the model is designed to anticipate sensory experiences based on prior exposure, thereby learning how to generalize experiences and predictions from limited sensory data.
The incorporation of attention, working memory, and sequential updating in the architecture enables the model to learn more gradually and build on experiences over time, much like how children do. Whereas an LLM processes data all at once, the PV-RNN processes information step-by-step, refining its understanding with each new piece of input.
Lessons from Toddler Learning
A key takeaway from the research is the observation of how this model learns and generalizes in a manner similar to young children. The model improves its grasp of concepts as it is exposed to the same word used in different contexts. This process reflects how toddlers interact with and learn concepts like color. When a toddler encounters several red objects—a truck, a ball, and a rose—they begin to understand that the underlying quality is redness, even when presented in different contexts. This learning approach contrasts with how traditional AI models often require large datasets with little to no variation in context to understand a single concept.
As Dr. Prasanna Vijayaraghavan, the first author of the study, notes: “The more exposure the model has to the same word in different contexts, the better it learns that word.” This finding further mirrors how toddlers will learn the word “red” much faster through interacting with a variety of red objects rather than repeatedly pushing a red truck.
This process, referred to as compositionality, lies at the heart of this breakthrough. It represents how different cognitive components (e.g., language, memory, and attention) combine to build a coherent understanding of concepts. Compositionality allows us to understand complex ideas using simple building blocks (like combining the color red with the concept of an object), and the PV-RNN model sheds light on how these building blocks work together in human learning. By simulating this type of learning in a robot-like model, researchers can better understand both AI development and human development.
Making AI Models More Transparent
An important challenge with large, complex neural networks today is their opacity—it is difficult, if not impossible, to observe how these networks process information internally. LLMs, for instance, are capable of producing impressive results but are largely black boxes. The researchers behind the PV-RNN believe that transparency in machine learning is vital for future AI systems, particularly for understanding decision-making processes, minimizing mistakes, and improving safety and reliability.
Dr. Vijayaraghavan observes, “Our model requires a significantly smaller training set and much less computing power to achieve compositionality. It does make more mistakes than LLMs do, but it makes mistakes that are similar to how humans make mistakes.” This means that the model is able to make errors that are contextually relevant, in the same way that humans might mistakenly apply an abstract concept like “red” incorrectly in a new situation, and learn from that error.
What stands out in the study is the fact that, due to its relatively shallow architecture, the PV-RNN allows researchers to peek into the model’s internal state—its evolving representation of information, essentially its memory or the way it perceives the world over time. By providing this insight, researchers can identify how information flows and how it is processed, giving them a more nuanced view of the model’s decision-making.
This increased transparency is crucial for the next step in AI development: understanding how models behave, making their behaviors more predictable, and ensuring they make decisions that align with human intentions, both in normal and new situations.
Embodied Learning and the Poverty of Stimulus
The PV-RNN model has also helped scientists understand aspects of cognitive theory, such as the Poverty of Stimulus problem. This concept suggests that children acquire language far more rapidly than what would be expected based solely on the linguistic data they receive. After all, young children aren’t typically exposed to the full range of vocabulary and sentence structures needed to develop a complete linguistic understanding. Nevertheless, children somehow manage to deduce complex grammatical rules through everyday interactions.
This issue has long puzzled linguists, but the results from the PV-RNN offer new insights into this phenomenon. Despite having limited input data, just like toddlers do, the model learns compositionality effectively by grounding language not only in observation but also in action and interaction with the world. It’s the embodied nature of learning that seems to help the AI “fill in the gaps,” learning language through action and context rather than relying solely on theoretical knowledge.
Implications for Safer and Ethical AI
Another key impact of the PV-RNN model is its potential to lead to safer, more ethical AI. Traditional LLMs learn linguistic knowledge in an abstract, decontextualized manner, without the emotional or embodied experiences that would inform the deeper meanings of concepts. For example, the word “suffering” in a purely linguistic model may hold little weight beyond its definition in a dictionary.
But for the PV-RNN, which learns from embodied interaction, concepts like suffering could be learned through direct interaction with sensory experiences and human action. This connection between meaning and experience could allow future AI systems to better understand the consequences of their actions and produce more empathetic, context-sensitive responses.
The notion of combining learning with real-world experiences mirrors how we teach children not just facts but emotional and ethical principles through real-life exposure. As AI systems become more integrated into human society, incorporating emotional and ethical considerations through embodied learning could prove essential for their safe deployment.
Future Directions
As the research continues, there are bound to be new discoveries about how AI systems and human-like cognitive processes intertwine. The development of models such as the PV-RNN opens the door for new types of research, exploring not just artificial intelligence but human cognitive growth. Understanding how language and action are combined could not only make AI systems more intelligent but also more reliable and ethical.
“We are continuing our work to enhance the capabilities of this model and are using it to explore various domains of developmental neuroscience,” says Dr. Jun Tani, the senior author of the study. “We are excited to see what future insights into cognitive development and language learning processes we can uncover.”
Conclusion
The development of the PV-RNN model by researchers at the Okinawa Institute of Science and Technology marks a significant step forward in understanding how AI can learn in a more human-like manner. By incorporating embodied interactions—combining language, vision, proprioception, and attention—the model offers valuable insights into compositionality, mirroring the way children generalize knowledge through varied experiences. Unlike current large language models that process vast amounts of data in opaque ways, the PV-RNN allows for greater transparency, enabling researchers to visualize and understand the network’s internal processes. This model also sheds light on cognitive development and language acquisition while paving the way for future AI systems that are safer, more ethical, and transparent. Ultimately, it opens new avenues in both AI research and developmental neuroscience, offering a glimpse into how learning, memory, and action can be intertwined to create more intelligent and socially responsible machines.
Reference: Prasanna Vijayaraghavan et al, Development of compositionality through interactive learning of language and action of robots, Science Robotics (2025). DOI: 10.1126/scirobotics.adp0751