
Google's AI Co-scientist is 'test-time scaling' on steroids. What that means for research
www.zdnet.com
ZDNETGoogle on Wednesday said it has tweaked its Gemini 2.0 large language model artificial intelligence offering to make it generate novel scientific hypotheses in a fraction of the time taken by teams of human lab researchers.The company bills the "AI Co-scientist" version of Gemini as "a promising advance toward AI-assisted technologies for scientists to help accelerate discovery," and a program meant to be run with a human "in the loop" to "act as a helpful assistant and collaborator to scientists and to help accelerate the scientific discovery process." It's also a demonstration of how so-called reasoning AI models are now driving the use of computing resources higher and higher, to cross-reference, evaluate, rank, sort, sift, and do lots of other things -- all after the prompt has been typed by the user. Google's AI Co-scientist is meant to have a "human in the loop," directing the machine's various operations, such as literature review and hypothesis formation. GoogleIn an audacious mash-up of scientific publishing and marketing, Google's researchers published a technical paper describing a hypothesis generated by Co-scientist simultaneously with a paper published by a group of human scientists at Imperial College London, with the same hypothesis. The Co-scientist hypothesis, concerning a specific fashion in which bacteria evolve to form new pathogens, took two days to produce, whereas the human-produced work was the result of a decade of study and lab work, claims Google. Hypothesis-formulation machineGoogle describes the machine as a hypothesis-formulation machine that uses multiple agents.Given a scientist's research goal that has been specified in natural language, the AI Co-scientist is designed to generate novel research hypotheses, a detailed research overview, and experimental protocols. To do so, it uses a coalition of specialized agents: Generation, Reflection, Ranking, Evolution, Proximity, and Meta-review. Google's design for AI Co-scientist has a person input a research goal at the prompt, whereupon a series of agents work in parallel to review the literature, formulate and evaluate hypotheses. Google The structure of AI Co-scientist is designed to perform the multiple agent tasks in parallel, backed up by a memory-management function for storing intermediate results. GoogleThe Co-scientist starts to work after the scientist types at the prompt their research goal "along with preferences, experiment constraints, and other attributes."Google insists the program goes beyond mere literature review to instead "uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and tailored to specific research objectives." Test-time scaling on steroidsThe modification of Gemini 2.0 emphasizes the use of "test-time scaling," where AI agents use increasing amounts of computing power to iteratively review and re-formulate their output. Test-time scaling has been seen most dramatically not only in Gemini, but also OpenAI's o1 model, and DeepSeek AI, all examples of so-called reasoning models that spend much more time responding to a prompt, generating intermediate results. The AI Co-scientist is a bit of test-time scaling on steroids. In the formal paper, authored by Juraj Gottweis of Google, and posted on the arXiv pre-print server, the authors specifically relate their work as a kind of enhancement of what DeepSeek's R1 model has pioneered: "Recent advancements, like the DeepSeek-R1 model, further demonstrate the potential of test-time compute by leveraging reinforcement learning to refine the model's "chain-of-thought" and enhance complex reasoning abilities over longer horizons. In this work, we propose a significant scaling of the test-time compute paradigm using inductive biases derived from the scientific method to design a multi-agent framework for scientific reasoning and hypothesis generation without any additional learning techniques."The Co-scientist is built from a selection of AI agents that can access external resources, relate Gottweis and team. "They are also equipped to interact with external tools, such as web search engines and specialized AI models, through application programming interfaces," they write. Where test-time scaling comes most into play is the notion of a "tournament," where the Co-scientist compares and ranks the multiple hypotheses it has generated. It does so using "Elo" scores, a common measurement system used to rank chess players and athletes. As Gottweis and team describe it, one of the agents, a "Ranking Agent," has the main responsibility of rating the differing hypotheses in a kind of competitive fashion:An important abstraction in the Co-scientist system is the notion of a tournament where different research proposals are evaluated and ranked, enabling iterative improvements. The Ranking agent employs and orchestrates an Elo-based tournament to assess and prioritize the generated hypotheses at any given time. This involves pairwise comparisons, facilitated by simulated scientific debates, which allow for a nuanced evaluation of the relative merits of each proposal. The ranking is supposed to make the better hypotheses bubble up to the top. "This ranking serves to communicate to scientists an ordered list of research hypotheses and proposals aligned with the research goal," as they put it. Google claims the data show that more and more compute, and ranking and re-ranking, makes the hypotheses increasingly better as rated by human observers. Surpasses models and unassisted human expertsAccording to fifteen human experts who reviewed the Co-scientist's output, the program gets better as it spends more computing time formulating hypotheses and evaluating them. Google says the AI Co-scientist surpasses the relative quality of plain-old Gemini 2.0 as the computing budget increases, leading to higher Elo scores as in chess and sports. Google"As the system spends more time reasoning and improving, the self-rated quality of results improves and surpasses models and unassisted human experts," the paper notes. The human observers generally gave Co-scientist "higher potential for novelty and impact, and preferred its outputs compared to other models," such as the unaltered Gemini 2.0 and OpenAI's o1 reasoning model. Given the emphasis on scaling computing effort, it's unfortunate that Gottweis and team nowhere in their 70-page technical report mention just how much computing was used for AI Co-scientist. The hypothesis, however, that they share, is that the rapid reduction in the cost of computing of the kind DeepSeek R1 demonstrates should make something like the Co-scientist usable by research labs broadly speaking. "The trends with distillation and inference time compute costs indicate that such intelligent and general AI systems are rapidly becoming more affordable and available," they note. Artificial Intelligence
0 Comentários
·0 Compartilhamentos
·48 Visualizações