I agree with OpenAI: You shouldnt use other peoples work without permission
arstechnica.com
deep irony I agree with OpenAI: You shouldnt use other peoples work without permission Op-ed: OpenAI says DeepSeek used its data improperly. That must be frustrating! Andrew Cunningham Jan 30, 2025 12:55 pm | 42 Credit: Benj Edwards / OpenAI Credit: Benj Edwards / OpenAI Story textSizeSmallStandardLargeWidth *StandardWideLinksStandardOrange* Subscribers only Learn moreChatGPT developer OpenAI and other players in the generative AI business were caught unawares this week by a Chinese company named DeepSeek, whose open source R1 simulated reasoning model provides results similar to OpenAI's best paid models (with some notable exceptions) despite being created using just a fraction of the computing power.Since ChatGPT, Stable Diffusion, and other generative AI models first became publicly available in late 2022 and 2023, the US AI industry has been undergirded by the assumption that you'd need ever-greater amounts of training data and compute power to continue improving their models and geteventually, maybeto a functioning version of artificial general intelligence, or AGI.Those assumptions were reflected in everything from Nvidia's stock price to energy investments and data center plans. Whether DeepSeek fundamentally upends those plans remains to be seen. But at a bare minimum, it has shaken investors who have poured money into OpenAI, a company that reportedly believes it won't turn a profit until the end of the decade.OpenAI CEO Sam Altman concedes that the DeepSeek R1 model is "impressive," but the company is taking steps to protect its models (both language and business); OpenAI told the Financial Times and other outlets that it believed DeepSeek had used output from OpenAI's models to train the R1 model, a method known as "distillation." Using OpenAI's models to train a model that will compete with OpenAI's models is a violation of the company's terms of service."We take aggressive, proactive countermeasures to protect our technology and will continue working closely with the US government to protect the most capable models being built here," an OpenAI spokesperson told Ars.So taking data without permission is bad, now?I'm not here to say whether the R1 model is the product of distillation. What I can say is that it's a little rich for OpenAI to suddenly be so very publicly concerned about the sanctity of proprietary data.The company is currently involved in several high-profile copyright infringement lawsuits, including one filed by The New York Times alleging that OpenAI and its partner Microsoft infringed its copyrights and that the companies provide the Times' content to ChatGPT users "without The Timess permission or authorization." Other authors and artists have suits working their way through the legal system as well.In its post responding to the suit, OpenAI claims that "like any single source, [New York Times] content didn't meaningfully contribute to the training of our existing models and also wouldn't be sufficiently impactful for future training," but that hasn't stopped the company from pursuing content deals with the Times and other news organizations (including Ars Technica owner Cond Nast), plus user-generated content sites like Reddit and StackOverflow and book publishers like HarperCollins.Collectively, the contributions from copyrighted sources are significant enough that OpenAI has said it would be "impossible" to build its large-language models without them. The implication being that copyrighted material hadalready been used to build these models long before these publisher deals were ever struck.That's also strongly implied by a comment that investment firm Andreessen Horowitz filed with the US Copyright Office in late 2023 (PDF) in which the firm argued that treating AI model training as copyright infringement"would upset at least a decades worth of investment-backed expectations." Also known as a16z, Andreessen Horowitz is an OpenAI investor, and founder Marc Andreessen is a prominent AI booster.The filing argues, among other things, that AI model training isn't copyright infringement because it "is in service of a non-exploitive purpose: to extract information from the works and put that information to use, thereby 'expand[ing] [the works] utility.'"Maybe DeepSeek did distill OpenAI's models to train its own, and maybe that is a violation of the terms of service OpenAI has published. But "extracting information and putting it to use" feels like a fair description of what DeepSeek has done here. If DeepSeek's work truly weren't possible without the work that OpenAI had already done, perhaps DeepSeek should think about compensating OpenAI in some way?This kind of hypocrisy makes it difficult for me to muster much sympathy for an AI industry that has treated the swiping of other humans' work as a completely legal and necessary sacrifice, a victimless crime that provides benefits that are so significant and self-evident that it's wasn't even worth having a conversation about it beforehand.A last bit of irony in the Andreessen Horowitz comment: There's some handwringing about the impact of a copyright infringement ruling on competition. Having to license copyrighted works at scale "would inure to the benefit of the largest tech companiesthose with the deepest pockets and the greatest incentive to keep AI models closed off to competition.""A multi-billion-dollar company might be able to afford to license copyrighted training data, but smaller, more agile startups will be shut out of the development race entirely," the comment continues. "The result will be far less competition, far less innovation, and very likely the loss of the United States position as the leader in global AI development."Some of the industry's agita about DeepSeek is probably wrapped up in the last bit of that statementthat a Chinese company has apparently beaten an American company to the punch on something. Andreessen himself referred to DeepSeek's model as a "Sputnik moment" for the AI business, implying that US companies need to catch up or risk being left behind. But regardless of geography, it feels an awful lot like OpenAI wants to benefit from unlimited access to others' work while also restricting similar access to its own work.Good luck with that!Andrew CunninghamSenior Technology ReporterAndrew CunninghamSenior Technology Reporter Andrew is a Senior Technology Reporter at Ars Technica, with a focus on consumer tech including computer hardware and in-depth reviews of operating systems like Windows and macOS. Andrew lives in Philadelphia and co-hosts a weekly book podcast called Overdue. 42 Comments
0 التعليقات
·0 المشاركات
·51 مشاهدة