To help AIs understand the world, researchers put them in a robot
arstechnica.com
Reality, what a concept To help AIs understand the world, researchers put them in a robot There's a difference between knowing a word and knowing a concept. Jacek Krywko Feb 1, 2025 7:05 am | 1 Credit: Thomas Vogel Credit: Thomas Vogel Story textSizeSmallStandardLargeWidth *StandardWideLinksStandardOrange* Subscribers only Learn moreLarge language models like ChatGPT display conversational skills, but the problem is they dont really understand the words they use. They are primarily systems that interact with data obtained from the real world but not the real world itself. Humans, on the other hand, associate language with experiences. We know what the word hot means because weve been burned at some point in our lives.Is it possible to get an AI to achieve a human-like understanding of language? A team of researchers at the Okinawa Institute of Science and Technology built a brain-inspired AI model comprising multiple neural networks. The AI was very limitedit could learn a total of just five nouns and eight verbs. But their AI seems to have learned more than just those words; it learned the concepts behind them.Babysitting robotic armsThe inspiration for our model came from developmental psychology. We tried to emulate how infants learn and develop language, says Prasanna Vijayaraghavan, a researcher at the Okinawa Institute of Science and Technology and the lead author of the study.While the idea of teaching AIs the same way we teach little babies is not newwe applied it to standard neural nets that associated words with visuals. Researchers also tried teaching an AI using a video feed from a GoPro strapped to a human baby. The problem is babies do way more than just associate items with words when they learn. They touch everythinggrasp things, manipulate them, throw stuff around, and this way, they learn to think and plan their actions in language. An abstract AI model couldnt do any of that, so Vijayaraghavans team gave one an embodied experiencetheir AI was trained in an actual robot that could interact with the world.Vijayaraghavans robot was a fairly simple system with an arm and a gripper that could pick objects up and move them around. Vision was provided by a simple RGB camera feeding videos in a somewhat crude 6464 pixels resolution.The robot and the camera were placed in a workspace, put in front of a white table with blocks painted green, yellow, red, purple, and blue. The robots task was to manipulate those blocks in response to simple prompts like move red left, move blue right, or put red on blue. All that didnt seem particularly challenging. What was challenging, though, was building an AI that could process all those words and movements in a manner similar to humans. I dont want to say we tried to make the system biologically plausible, Vijayaraghavan told Ars. Lets say we tried to draw inspiration from the human brain.Chasing free energyThe starting point for Vijayaraghavans team was the free energy principle, a hypothesis that the brain constantly makes predictions about the world based on internal models, then updates these predictions based on sensory input. The idea is that we first think of an action plan to achieve a desired goal, and then this plan is updated in real time based on what we experience during execution. This goal-directed planning scheme, if the hypothesis is correct, governs everything we do, from picking up a cup of coffee to landing a dream job.All that is closely intertwined with language. Neuroscientists at the University of Parma found that motor areas in the brain got activated when the participants in their study listened to action-related sentences. To emulate that in a robot, Vijayaraghavan used four neural networks working in a closely interconnected system. The first was responsible for processing visual data coming from the camera. It was tightly integrated with a second neural net that handled proprioception: all the processes that ensured the robot was aware of its position and the movement of its body. This second neural net also built internal models of actions necessary to manipulate blocks on the table. Those two neural nets were additionally hooked up to visual memory and attention modules that enabled them to reliably focus on the chosen object and separate it from the image's background.The third neural net was relatively simple and processed language using vectorized representations of those move red right sentences. Finally, the fourth neural net worked as an associative layer and predicted the output of the previous three at every time step. When we do an action, we dont always have to verbalize it, but we have this verbalization in our minds at some point, Vijayaraghavan says. The AI he and his team built was meant to do just that: seamlessly connect language, proprioception, action planning, and vision.When the robotic brain was up and running, they started teaching it some of the possible combinations of commands and sequences of movements. But they didnt teach it all of them.The birth of compositionalityIn 2016, Brenden Lake, a professor of psychology and data science, published a paper in which his team named a set of competencies machines need to master to truly learn and think like humans. One of them was compositionality: the ability to compose or decompose a whole into parts that can be reused. This reuse lets them generalize acquired knowledge to new tasks and situations. The compositionality phase is when children learn to combine words to explain things. They [initially] learn the names of objects, the names of actions, but those are just single words. When they learn this compositionality concept, their ability to communicate kind of explodes, Vijayaraghavan explains.The AI his team built was made for this exact purpose: to see if it would develop compositionality. And it did.Once the robot learned how certain commands and actions were connected, it also learned to generalize that knowledge to execute commands it never heard before. recognizing the names of actions it had not performed and then performing them on combinations of blocks it had never seen. Vijayaraghavans AI figured out the concept of moving something to the right or the left or putting an item on top of something. It could also combine words to name previously unseen actions, like putting a blue block on a red one.While teaching robots to extract concepts from language has been done before, those efforts were focused on making them understand how words were used to describe visuals. Vijayaragha built on that to include proprioception and action planning, basically adding a layer that integrated sense and movement to the way his robot made sense of the world.But some issues are yet to overcome. The AI had very limited workspace. The were only a few objects and all had a single, cubical shape. The vocabulary included only names of colors and actions, so no modifiers, adjectives, or adverbs. Finally, the robot had to learn around 80 percent of all possible combinations of nouns and verbs before it could generalize well to the remaining 20 percent. Its performance was worse when those ratios dropped to 60/40 and 40/60.But its possible that just a bit more computing power could fix this. What we had for this study was a single RTX 3090 GPU, so with the latest generation GPU, we could solve a lot of those issues, Vijayaraghavan argued. Thats because the team hopes that adding more words and more actions wont result in a dramatic need for computing power. We want to scale the system up. We have a humanoid robot with cameras in its head and two hands that can do way more than a single robotic arm. So thats the next step: using it in the real world with real world robots, Vijayaraghavan said.Science Robotics, 2025. DOI: 10.1126/scirobotics.adp0751Jacek KrywkoAssociate WriterJacek KrywkoAssociate Writer Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry. 1 Comments
0 Commentarios
·0 Acciones
·74 Views