An artificial intelligence has been able to match images and words after being trained with recordings of what a baby sees and hears on a daily basis. The system, presented Thursday in the journal Science , shows that the ingredients needed for an AI to start learning language are minimal, and opens the door to developing more efficient models that learn more like how we humans learn.
Other AIs had been able to match objects to words before, but none had done so with so little training data, or by learning like a child.
The recordings with which they trained the model come from a camera and a voice recorder that Sam, an Australian boy, wore embedded in a helmet for two-hour periods per week from the age of 6 to 25 months. The boy lived in a family with two cats in a semi-rural environment 30 kilometers from Adelaide.
The researchers have similar recordings of two other girls – Alice, from California, and Asa, from New York – although they have not used them to train this first AI that simulates the human brain’s language learning.
The documents are part of a database developed by the Massachusetts Institute of Technology (MIT) in 2021 to analyze children’s learning.
The advance is fundamental to understanding why “humans are much more efficient with data when learning language, compared to the best current AI systems”, Wai Keen explains in an email to La Vanguardia Vong, researcher at New York University (NYU) and co-author of the article.
“There is a big data gap between the way in which AI systems and children learn language”, corroborates Brenden Lake, also a researcher at NYU, who has led the research. “It is fundamental for researchers to bridge this gap if we want to build machines that can learn and think like people and if we are to expand the training of language models beyond the limits of the big technologies”, he concludes.
More humane learning implies more efficient learning in terms of the amount of resources needed for training. In short, the model is cheaper, which represents a key goal of academic research, which aims to ensure that scientists working in public centers can have access to these tools, despite the fact that they have a smaller budget than large ones corporations
To bridge the data gap Lake talks about, the machine just needs to understand that when a word and an image appear at the same time, it means they are related, while if they appear separated in time, it means they are not. This information is enough for the neural networks to start learning language “when combined with the type of input that a child receives”, details Vong.
Although the new approach is a little closer to human learning, both in the way it acquires the data and in the way it is processed, it is still far from describing exactly how children acquire their first words. The aim of the study was for the system to learn with the minimum possible capacities, the authors acknowledge; the extent to which other cognitive skills contribute to language development is something that will need to be analyzed in the future.
“Both the model and the data are still quite limited compared to the experiences and abilities of children”, who by the time they are two years old already master about 300 words, explains Lake. Unlike children, AI is unable to relate to the environment: it has no senses, it learns passively, it has no social skills and it has no goals or needs.
In addition, the AI ??learns through transcribed words, not spoken language, which, on the one hand, makes its task easier – reading words is simpler than hearing them, both in the technical and logistical areas – and , on the other hand, it damages it, since details such as intonation are lost. The research team is currently working to improve speech recognition technology to address the latter limitation, hoping that it will “provide more details about language acquisition,” concludes Vong.
The year and a half of weekly recordings of the baby translated into 61 hours of data – about 600,000 video frames and 37,500 statements – with which the NYU scientists trained the artificial intelligence, which watched them once and another Repetition of the same information also differs from the richness of the day-to-day life of the child, who learns in long and varied episodes, instead of short and repeated fragments.
To evaluate the performance of the system, the team designed a test in which it gave the AI ??a word and it had to recognize the object it referred to among four options, all known and studied during the training phase. The model was correct in six out of ten cases, an accuracy only five percentage points lower than that of another model, CLIP, which has been trained with 400 million pairs of images and words from the Internet.
In a second test, the researchers wanted to evaluate whether the model is able to generalize, that is to say, to recognize the objects with which it had been trained, but on a white background and with slightly different shapes. In this case, the system got 37% of the 67 words it was asked right, slightly above a random result (25%), which shows that, despite the difficulty, its AI does have a certain capacity for abstraction.
The team hopes to improve the results by expanding the amount of data they trained the model with and by adding cognitive and sensory skills to the AI. That is why they have extra footage of the boy they trained her with, as well as that of the other two girls who participated in the same research project.