Why AI Safety Researchers Are Worried About DeepSeek
time.com
By Billy PerrigoJanuary 29, 2025 12:07 PM ESTThe release of DeepSeek R1 stunned Wall Street and Silicon Valley this month, spooking investors and impressing tech leaders. But amid all the talk, many overlooked a critical detail about the way the new Chinese AI model functionsa nuance that has researchers worried about humanitys ability to control sophisticated new artificial intelligence systems. Its all down to an innovation in how DeepSeek R1 was trainedone that led to surprising behaviors in an early version of the model, which researchers described in the technical documentation accompanying its release.During testing, researchers noticed that the model would spontaneously switch between English and Chinese while it was solving problems. When they forced it to stick to one language, thus making it easier for users to follow along, they found that the systems ability to solve the same problems would diminish.That finding rang alarm bells for some AI safety researchers. Currently, the most capable AI systems think in human-legible languages, writing out their reasoning before coming to a conclusion. That has been a boon for safety teams, whose most effective guardrails involve monitoring models so-called chains of thought for signs of dangerous behaviors. But DeepSeeks results raised the possibility of a decoupling on the horizon: one where new AI capabilities could be gained from freeing models of the constraints of human language altogether.To be sure, DeepSeek's language switching is not by itself cause for alarm. Instead, what worries researchers is the new innovation that caused it. The DeepSeek paper describes a novel training method whereby the model was rewarded purely for getting correct answers, regardless of how comprehensible its thinking process was to humans. The worry is that this incentive-based approach could eventually lead AI systems to develop completely inscrutable ways of reasoning, maybe even creating their own non-human languages, if doing so proves to be more effective.Were the AI industry to proceed in that directionseeking more powerful systems by giving up on legibilityit would take away what was looking like it could have been an easy win for AI safety, says Sam Bowman, the leader of a research department at Anthropic, an AI company, focused on aligning AI to human preferences. We would be forfeiting an ability that we might otherwise have had to keep an eye on them.Thinking without wordsAn AI creating its own alien language is not as outlandish as it may sound.Last December, Meta researchers set out to test the hypothesis that human language wasnt the optimal format for carrying out reasoningand that large language models (or LLMs, the AI systems that underpin OpenAIs ChatGPT and DeepSeeks R1) might be able to reason more efficiently and accurately if they were unhobbled by that linguistic constraint. The Meta researchers went on to design a model that, instead of carrying out its reasoning in words, did so using a series of numbers that represented the most recent patterns inside its neural networkessentially its internal reasoning engine. This model, they discovered, began to generate what they called "continuous thoughts"essentially numbers encoding multiple potential reasoning paths simultaneously. The numbers were completely opaque and inscrutable to human eyes. But this strategy, they found, created emergent advanced reasoning patterns in the model. Those patterns led to higher scores on some logical reasoning tasks, compared to models that reasoned using human language. Though the Meta research project was very different to DeepSeeks, its findings dovetailed with the Chinese research in one crucial way.Both DeepSeek and Meta showed that human legibility imposes a tax on the performance of AI systems, according to Jeremie Harris, the CEO of Gladstone AI, a firm that advises the U.S. government on AI safety challenges. In the limit, there's no reason that [an AIs thought process] should look human legible at all, Harris says.And this possibility has some safety experts concerned.It seems like the writing is on the wall that there is this other avenue available [for AI research], where you just optimize for the best reasoning you can get, says Bowman, the Anthropic safety team leader. I expect people will scale this work up. And the risk is, we wind up with models where were not able to say with confidence that we know what they're trying to do, what their values are, or how they would make hard decisions when we set them up as agents.For their part, the Meta researchers argued that their research need not result in humans being relegated to the sidelines. It would be ideal for LLMs to have the freedom to reason without any language constraints, and then translate their findings into language only when necessary, they wrote in their paper. (Meta did not respond to a request for comment on the suggestion that the research could lead in a dangerous direction.)The limits of languageOf course, even human-legible AI reasoning isn't without its problems.When AI systems explain their thinking in plain English, it might look like they're faithfully showing their work. But some experts aren't sure if these explanations actually reveal how the AI really makes decisions. It could be like asking a politician for the motivations behind a policythey might come up with an explanation that sounds good, but has little connection to the real decision-making process.While having AI explain itself in human terms isn't perfect, many researchers think it's better than the alternative: letting AI develop its own mysterious internal language that we can't understand. Scientists are working on other ways to peek inside AI systems, similar to how doctors use brain scans to study human thinking. But these methods are still new, and haven't yet given us reliable ways to make AI systems safer.So, many researchers remain skeptical of efforts to encourage AI to reason in ways other than human language.If we don't pursue this path, I think we'll be in a much better position for safety, Bowman says. If we do, we will have taken away what, right now, seems like our best point of leverage on some very scary open problems in alignment that we have not yet solved.
0 Comments
·0 Shares
·40 Views