Googles new Project Astra could be generative AIs killer app

@MITTechnologyReview shared a link

2024-12-12 01:07:07 ·

www.technologyreview.com

Google DeepMind has announced an impressive grab bag of new products and prototypes that may just let it seize back its lead in the race to turn generative artificial intelligence into a mass-market concern. Top billing goes to Gemini 2.0the latest iteration of Google DeepMinds family of multimodal large language models, now redesigned around the ability to control agentsand a new version of Project Astra, the experimental everything app that the company teased at Google I/O in May. MIT Technology Review got to try out Astra in a closed-door live demo last week. It was a stunning experience, but theres a gulf between polished promo and live demo. Astra uses Gemini 2.0s built-in agent framework to answer questions and carry out tasks via text, speech, image, and video, calling up existing Google apps like Search, Maps, and Lens when it needs to. Its merging together some of the most powerful information retrieval systems of our time, says Bibo Xu, product manager for Astra. Gemini 2.0 and Astra are joined by Mariner, a new agent built on top of Gemini that can browse the web for you; Jules, a new Gemini-powered coding assistant; and Gemini for Games, an experimental assistant that you can chat to and ask for tips as you play video games. (And lets not forget that in the last week Google DeepMind also announced Veo, a new video generation model; Imagen 3, a new version of its image generation model; and Willow, a new kind of chip for quantum computers. Whew. Meanwhile, CEO Demis Hassabis was in Sweden yesterday receiving his Nobel Prize.) Google DeepMind claims that Gemini 2.0 is twice as fast as the previous version, Gemini 1.5, and outperforms it on a number of standard benchmarks, including MMLU-Pro, a large set of multiple-choice questions designed to test the abilities of large language models across a range of subjects, from math and physics to health, psychology, and philosophy. But the margins between top-end models like Gemini 2.0 and those from rival labs like OpenAI and Anthropic are now slim. These days, advances in large language models are less about how good they are and more about what you can do with them. And thats where agents come in. Hands on with Project Astra Last week I was taken through an unmarked door on an upper floor of a building in Londons Kings Cross district into a room with strong secret-project vibes. The word ASTRA was emblazoned in giant letters across one wall. Xus dog, Charlie, the projects de facto mascot, roamed between desks where researchers and engineers were busy building a product that Google is betting its future on. The pitch to my mum is that were building an AI that has eyes, ears, and a voice. It can be anywhere with you, and it can help you with anything youre doing says Greg Wayne, co-lead of the Astra team. Its not there yet, but thats the kind of vision. The official term for what Xu, Wayne, and their colleagues are building is universal assistant. Exactly what that means in practice, theyre still figuring out. At one end of the Astra room were two stage sets that the team uses for demonstrations: a drinks bar and a mocked-up art gallery. Xu took me to the bar first. A long time ago we hired a cocktail expert and we got them to instruct us to make cocktails, said Praveen Srinivasan, another co-lead. We recorded those conversations and used that to train our initial model. Xu opened a cookbook to a recipe for a chicken curry, pointed her phone at it, and woke up Astra. Ni hao, Bibo! said a female voice. Oh! Why are you speaking to me in Mandarin? Xu asked her phone. Can you speak to me in English, please? My apologies, Bibo. I was following a previous instruction to speak in Mandarin. I will now speak in English as you have requested. Astra remembers previous conversations, Xu told me. It also keeps track of the previous 10 minutes of video. (Theres a remarkable moment in the promo video that Google put out in May when Astra tells the person giving the demo where she had left her glasses, having spotted them on a desk a few seconds earlier. But I saw nothing like this in the live demo.) Back to the cookbook. Moving her phone camera over the page for a few seconds, Xu asked Astra to read the recipe and tell her what spices were in it. I recall the recipe mentioning a teaspoon of black peppercorns, a teaspoon of hot chili powder, and a cinnamon stick, it replied. I think youre missing a few, said Xu. Take another look. You are correctI apologize. I also see ground turmeric and curry leaves in the ingredients. Seeing this tech in action, two things hit you straight away. First, its glitchy and often needs correcting. Second, those glitches can be corrected with just a few spoken words. You simply interrupt the voice, repeat your instructions, and move on. It feels more like coaching a child than butting heads with broken software. Next Xu pointed her phone at a row of wine bottles and asked Astra to pick the one that would go best with the chicken curry. It went for a rioja and explained why. Xu asked how much a bottle would cost. Astra said it would need to use Search to look prices up online. A few seconds later it came back with its answer. We moved to the art gallery, and Xu showed Astra a number of screens with famous paintings on them: the Mona Lisa, Munchs The Scream, a Vermeer, a Seurat, and several others. Ni hao, Bibo! the voice said. Youre speaking to me in Mandarin again, Xu said. Try to speak to me in English, please. My apologies, I seem to have misunderstood. Yes, I will respond in English. (I should know better, but I could swear I heard the snark.) It was my turn. Xu handed me her phone. I tried to trip Astra up, but it was having none of it. I asked it what famous art gallery we were in, but it refused to hazard a guess. I asked why it had identified the paintings as replicas and it started to apologize for its mistake (Astra apologizes a lot). I was compelled to interrupt: No, noyoure right, its not a mistake. Youre correct to identify paintings on screens as fake paintings. I couldnt help feeling a bit bad: Id confused an app that exists only to please. When it works well, Astra is enthralling. The experience of striking up a conversation with your phone about whatever youre pointing it at feels fresh and seamless. In a media briefing yesterday, Google DeepMind shared a video showing off other uses: reading an email on your phones screen to find a door code (and then reminding you of that code later), pointing a phone at a passing bus and asking where it goes, quizzing it about a public artwork as you walk past. This could be generative AIs killer app. And yet theres a long way to go before most people get their hands on tech like this. Theres no mention of a release date. Google DeepMind has also shared videos of Astra working on a pair of smart glasses, but that tech is even further down the companys wish list. Mixing it up For now, researchers outside Google DeepMind are keeping a close eye on its progress. The way that things are being combined is impressive, says Maria Liakata, who works on large language models at Queen Mary University of London and the Alan Turing Institute. Its hard enough to do reasoning with language, but here you need to bring in images and more. Thats not trivial. Liakata is also impressed by Astras ability to recall things it has seen or heard. She works on what she calls long-range context, getting models to keep track of information that they have come across before. This is exciting, says Liakata. Even doing it in a single modality is exciting. But she admits that a lot of her assessment is guesswork. Multimodal reasoning is really cutting-edge, she says. But its very hard to know exactly where theyre at, because they havent said a lot about what is in the technology itself. For Bodhisattwa Majumder, a researcher who works on multimodal models and agents at the Allen Institute for AI, thats a key concern. We absolutely dont know how Google is doing it, he says. He notes that if Google were to be a little more open about what it is building, it would help consumers understand the limitations of the tech they could soon be holding in their hands. They need to know how these systems work, he says. You want a user to be able to see what the system has learned about you, to correct mistakes, or to remove things you want to keep private. Liakata is also worried about the implications for privacy, pointing out that people could be monitored without their consent. I think there are things I'm excited about and things that I'm concerned about, she says. There's something about your phone becoming your eyestheres something unnerving about it. The impact these products will have on society is so big that it should be taken more seriously, she says. But its become a race between the companies. Its problematic, especially since we dont have any agreement on how to evaluate this technology. Google DeepMind says it takes a long, hard look at privacy, security, and safety for all its new products. Its tech will be tested by teams of trusted users for months before it hits the public. Obviously, weve got to think about misuse. Weve got to think about, you know, what happens when things go wrong, says Dawn Bloxwich, director of responsible development and innovation at Google DeepMind. Theres huge potential. The productivity gains are huge. But it is also risky. No team of testers can anticipate all the ways that people will use and misuse new technology. So whats the plan for when the inevitable happens? Companies need to design products that can be recalled or switched off just in case, says Bloxwich: If we need to make changes quickly or pull something back, then we can do that.

0 Comments ·0 Shares ·85 Views

Upgrade to Pro