
Researchers suggest OpenAI trained AI models on paywalled OReilly books
techcrunch.com
OpenAI has been accused by many parties of training its AI on copyrighted content sans permission. Now a new paper by an AI watchdog organization makes the serious accusation that the company increasingly relied on non-public books it didnt license to train more sophisticated AI models.AI models are essentially complex prediction engines. Trained on a lot of data books, movies, TV shows, and so on they learn patterns and novel ways to extrapolate from a simple prompt. When a model writes an essay on a Greek tragedy or draws Ghibli-style images, its simply pulling from its vast knowledge to approximate. It isnt arriving at anything new.While a number of AI labs including OpenAI have begun embracing AI-generated data to train AI as they exhaust real-world sources (mainly the public web), few have eschewed real-world data entirely. Thats likely because training on purely synthetic data comes with risks, like worsening a models performance.The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim OReilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its GPT-4o model on paywalled books from OReilly Media. (OReilly is the CEO of OReilly Media.)In ChatGPT, GPT-4o is the default model. OReilly doesnt have a licensing agreement with OpenAI, the paper says.GPT-4o, OpenAIs more recent and capable model, demonstrates strong recognition of paywalled OReilly book content [] compared to OpenAIs earlier model GPT-3.5 Turbo, wrote the co-authors of the paper. In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible OReilly book samples.The paper used a method called DE-COP, first introduced in an academic paper in 2024, designed to detect copyrighted content in language models training data. Also known as a membership inference attack, the method tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text. If it can, it suggests that the model might have prior knowledge of the text from its training data.The co-authors of the paper OReilly, Strauss, and AI researcher Sruly Rosenblat say that they probed GPT-4o, GPT-3.5 Turbo, and other OpenAI models knowledge of OReilly Media books published before and after their training cutoff dates. They used 13,962 paragraph excerpts from 34 OReilly books to estimate the probability that a particular excerpt had been included in a models training dataset.According to the results of the paper, GPT-4o recognized far more paywalled OReilly book content than OpenAIs older models, including GPT-3.5 Turbo. Thats even after accounting for potential confounding factors, the authors said, like improvements in newer models ability to figure out whether text was human-authored.GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public OReilly books published prior to its training cutoff date, wrote the co-authors. It isnt a smoking gun, the co-authors are careful to note. They acknowledge that their experimental method isnt foolproof, and that OpenAI mightve collected the paywalled book excerpts from users copying and pasting it into ChatGPT.Muddying the waters further, the co-authors didnt evaluate OpenAIs most recent collection of models, which includes GPT-4.5 and reasoning models such as o3-mini and o1. Its possible that these models werent trained on paywalled OReilly book data, or were trained on a lesser amount than GPT-4o.That being said, its no secret that OpenAI, which has advocated for looser restrictions around developing models using copyrighted data, has been seeking higher-quality training data for some time. The company has gone so far as to hire journalists to help fine-tune its models outputs. Thats a trend across the broader industry: AI companies recruiting experts in domains like science and physics to effectively have these experts feed their knowledge into AI systems.It should be noted that OpenAI pays for at least some of its training data. The company has licensing deals in place with news publishers, social networks, stock media libraries, and others. OpenAI also offers opt-out mechanisms albeit imperfect ones that allow copyright owners to flag content theyd prefer the company not use for training purposes.Still, as OpenAI battles several suits over its training data practices and treatment of copyright law in U.S. courts, the OReilly paper isnt the most flattering look.OpenAI didnt respond to a request for comment.
0 Comentários
·0 Compartilhamentos
·42 Visualizações