OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us
www.404media.co
The narrative that OpenAI, Microsoft, and freshly minted White House AI czar David Sacks are now pushing to explain why DeepSeek was able to create a large language model that outpaces OpenAIs while spending orders of magnitude less money and using older chips is that DeepSeek used OpenAIs data unfairly and without compensation. Sound familiar?BothBloomberg and the Financial Times are reporting that Microsoft and OpenAI have been probing whether DeepSeek improperly trained the R1 model that is taking the AI world by storm on the outputs of OpenAI models.Here is how the Bloomberg article begins: Microsoft Corp. and OpenAI are investigating whether data output from OpenAIs technology was obtained in an unauthorized manner by a group linked to Chinese artificial intelligence startup DeepSeek, according to people familiar with the matter. The story goes on to say that Such activity could violate OpenAIs terms of service or could indicate the group acted to remove OpenAIs restrictions on how much data they could obtain, the people said.The venture capitalist and new Trump administration member David Sacks, meanwhile, said that there is substantial evidence that DeepSeek distilled the knowledge out of OpenAIs models.Theres a technique in AI called distillation, which youre going to hear a lot about, and its when one model learns from another model, effectively what happens is that the student model asks the parent model a lot of questions, just like a human would learn, but AIs can do this asking millions of questions, and they can essentially mimic the reasoning process they learn from the parent model and they can kind of suck the knowledge of the parent model, Sacks told Fox News. Theres substantial evidence that what DeepSeek did here is they distilled the knowledge out of OpenAIs models and I dont think OpenAI is very happy about this.I will explain what this means in a moment, but first: Hahahahahahahahahahahahahahahaha hahahhahahahahahahahahahahaha. It is, as many have already pointed out, incredibly ironic that OpenAI, a company that has been obtaining large amounts of data from all of humankind largely in an unauthorized manner, and, in some cases, in violation of the terms of service of those from whom they have been taking from, is now complaining about the very practices by which it has built its company.The argument that OpenAI, and every artificial intelligence company who has been sued for surreptitiously and indiscriminately sucking up whatever data it can find on the internet is not that they are not sucking up all of this data, it is that they are sucking up this data and they are allowed to do so.OpenAI is currently being sued by the New York Times for training on its articles, and its argument is that this is perfectly fine under copyright law fair use protections.Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness, OpenAI wrote in a blog post. In its motion to dismiss in court, OpenAI wrote it has long been clear that the non-consumptive use of copyrighted material (like large language model training) is protected by fair use.OpenAI and Microsoft are essentially now whining about being beaten at its own game by DeepSeek. But additionally, part of OpenAIs argument in the New York Times case is that the only way to make a generalist large language model that performs well is by sucking up gigantic amounts of data. It tells the court that it needs a huge amount of data to make a generalist language model, meaning any one source of data is not that important. This is funny, because DeepSeek managed to make a large language model that rivals and outpaces OpenAIs own without falling into the more data = better model trap. Instead, DeepSeek used a reinforcement learning strategy that its paper claims is far more efficient than weve seen other AI companies do.OpenAIs motion to dismiss the New York Times lawsuit states as part of its argument that the key to generalist language models is scale, meaning that part of its argument is that any individual piece of stolen content cannot make a large language model, and that what allows OpenAI to make industry-leading large language models is this idea of scale. OpenAIs lawyers quote from a New York Times article about this strategy as part of their argument: The amount of data needed was staggering to create GPT-3, it wrote. It was that unprecedented scale that allowed the model to internalize not only a map of human language, but achieve a level of adaptabilityand emergent intelligencethat no one thought possible.As Sacks mentioned, distillation is an established principle in artificial intelligence research, and its something that is done all the time to refine and improve the accuracy of smaller large language models. This process is so normalized in deep learning that the most often cited paper about it was coauthored by Geoffrey Hinton, part of a body of work that just earned him the Nobel Prize. Hintons paper specifically suggests that distillation is a way to make large language models more efficient, and that distilling works very well for transferring knowledge from an ensemble or from a large highly regularized model into a smaller, distilled model.An IBM article on distillation notes The LLMs with the highest capabilities are, in most cases, too costly and computationally demanding to be accessible to many would-be users like hobbyists, startups or research institutions knowledge distillation has emerged as an important means of transferring the advanced capabilities of large, often proprietary models to smaller, often open-source models. As such, it has become an important tool in the democratization of generative AI.In late December, OpenAI CEO Sam Altman took what many people saw as a veiled shot at DeepSeek, immediately after the release of DeepSeek V3, an earlier DeepSeek model. It is (relatively) easy to copy something that you know works, Altman tweeted. It is extremely hard to do something new, risky, and difficult when you dont know if it will work.Its also extremely hard to rally a big talented research team to charge a new hill in the fog together, he added. This is the key to driving progress forward.Even this is ridiculous, though. Besides being trained on huge amounts of other peoples data, OpenAIs work builds on research pioneered by Google, which itself builds on earlier academic research. This is, simply, how artificial intelligence research (and scientific research more broadly) works.This is all to say that, if OpenAI argues that it is legal for the company to train on whatever it wants for whatever reason it wants, then it stands to reason that it doesnt have much of a leg to stand on when competitors use common strategies used in the world of machine learning to make their own models. But of course, it is going with the argument that it must protect [its] IP.We know PRC based companies and others are constantly trying to distill the models of leading US AI companies, an OpenAI spokesperson told Bloomberg. As the leading builder of AI, we engage in countermeasures to protect our IP, including a careful process for which frontier capabilities to include in released models, and believe as we go forward that it is critically important that we are working closely with the US government to best protect the most capable models from efforts by adversaries and competitors to take US technology.Jason is a cofounder of 404 Media. He was previously the editor-in-chief of Motherboard. He loves the Freedom of Information Act and surfing.More from Jason Koebler
0 Commentaires
·0 Parts
·44 Vue