The promise and perils of synthetic data

shared a link

2024-12-24 18:27:55 -

Is it possible for an AI to be trained just on data generated by another AI? It might sound like a harebrained idea. But its one thats been around for quite some time and as new, real data is increasingly hard to come by, its been gaining traction.Anthropic used some synthetic data to train one of its flagship models, Claude 3.5 Sonnet. Meta fine-tuned its Llama 3.1 models using AI-generated data. And OpenAI is said to be sourcing synthetic training data from o1, its reasoning model, for the upcoming Orion.But why does AI need data in the first place and what kind of data does it need? And can this data really be replaced by synthetic data?The importance of annotationsAI systems are statistical machines. Trained on a lot of examples, they learn the patterns in those examples to make predictions, like that to whom in an email typically precedes it may concern.Annotations, usually text labeling the meaning or parts of the data these systems ingest, are a key piece in these examples. They serve as guideposts, teaching a model to distinguish among things, places, and ideas.Consider a photo-classifying model shown lots of pictures of kitchens labeled with the word kitchen. As it trains, the model will begin to make associations between kitchen and general characteristics of kitchens (e.g. that they contain fridges and countertops). After training, given a photo of a kitchen that wasnt included in the initial examples, the model should be able to identify it as such. (Of course, if the pictures of kitchens were labeled cow, it would identify them as cows, which emphasizes the importance of good annotation.)The appetite for AI and the need to provide labeled data for its development have ballooned the market for annotation services. Dimension Market Research estimates that its worth $838.2 million today and will be worth $10.34 billion in the next 10 years. While there arent precise estimates of how many people engage in labeling work, a 2022paperpegs the number in the millions.Companies large and small rely on workers employed by data annotation firms to create labels for AI training sets. Some of these jobs pay reasonably well, particularly if the labeling requires specialized knowledge (e.g. math expertise). Others can be backbreaking. Annotators in developing countries are paid only a few dollars per hour on average, without any benefits or guarantees of future gigs.A drying data wellSo theres humanistic reasons to seek out alternatives to human-generated labels. For example, Uber is expanding its fleet of gig workers to work on AI annotation and data labeling. But there are also practical ones.Humans can only label so fast. Annotators also have biases that can manifest in their annotations, and, subsequently, any models trained on them. Annotators make mistakes, or get tripped up by labeling instructions. And paying humans to do things is expensive.Data in general is expensive, for that matter. Shutterstock is charging AI vendors tens of millions of dollars to access its archives, while Reddithas made hundreds of millions from licensing data to Google, OpenAI, and others.Lastly, data is also becoming harder to acquire.Most models are trained on massive collections of public data data that owners are increasingly choosing to gate over fears it will beplagiarized or that they wont receive credit or attribution for it. More than 35% of the worlds top 1,000 websitesnow block OpenAIs web scraper. And around 25% of data from high-quality sources has been restricted from the major datasets used to train models, one recentstudyfound.Should the current access-blocking trend continue, the research group Epoch AIprojectsthat developers will run out of data to train generative AI models between 2026 and 2032. That, combined with fears ofcopyright lawsuitsand objectionable material making their way into open datasets, has forced a reckoning for AI vendors.Synthetic alternativesAt first glance, synthetic data would appear to be the solution to all these problems. Need annotations? Generate em. More example data? No problem. The skys the limit.And to a certain extent, this is true.If data is the new oil,syntheticdata pitches itself as biofuel, creatable without the negative externalities of the real thing, Os Keyes, a PhD candidate at the University of Washington who studies the ethical impact of emerging technologies, told TechCrunch. You can take a small starting set of data and simulate and extrapolate new entries from it.The AI industry has taken the concept and run with it. This month, Writer, an enterprise-focused generative AI company, debuted a model, Palmyra X 004, trained almost entirely on synthetic data. Developing it cost just $700,000, Writer claims compared to estimates of $4.6 million for a comparably-sized OpenAI model.Microsofts Phi open models were trained using synthetic data, in part. So were Googles Gemma models. Nvidia this summer unveiled a model family designed to generate synthetic training data, and AI startup Hugging Face recently released what it claims is the largest AI training dataset of synthetic text.Synthetic data generation has become a business in its own right one that could be worth $2.34 billion by 2030. Gartnerpredictsthat 60% of the data used for AI and analytics projects this year will be synthetically generated.Luca Soldaini, a senior research scientist at the Allen Institute for AI, noted that synthetic data techniques can be used to generate training data in a format thats not easily obtained through scraping (or even content licensing). For example, in training its video generator Movie Gen, Meta used Llama 3 to create captions for footage in the training data, which humans then refined to add more detail, like descriptions of the lighting.Along these same lines, OpenAI says that it fine-tunedGPT-4o using synthetic data to build the sketchpad-like Canvas feature for ChatGPT. And Amazon has said that it generates synthetic data to supplement the real-world data it uses to train speech recognition models for Alexa.Synthetic data models can be used to quickly expand upon human intuition of whichdatais needed to achieve a specific model behavior, Soldaini said. Synthetic risksSynthetic data is no panacea, however. It suffers from the same garbage in, garbage out problem as all AI. Models create synthetic data, and if the data used to train these models has biases and limitations, their outputs will be similarly tainted. For instance, groups poorly represented in the base data will be so in the synthetic data.The problem is, you can only do so much, Keyes said. Say you only have 30 Black people in a dataset. Extrapolating out might help, but if those 30 people are all middle-class, or all light-skinned, thats what the representative data will all look like.To this point, a 2023 study by researchers at Rice University and Stanford found that over-reliance on synthetic data during training can create models whose quality or diversity progressively decrease. Sampling bias poor representation of the real world causes a models diversity to worsen after a few generations of training, according to the researchers (although they also found that mixing in a bit of real-world data helps to mitigate this).Keyes sees additional risks in complex models such as OpenAIs o1, which he thinks could produce harder-to-spot hallucinations in their synthetic data. These, in turn, could reduce the accuracy of models trained on the data especially if the hallucinations sources arent easy to identify.Complex models hallucinate; data produced by complex models contain hallucinations, Keyes added. And with a model like o1, the developers themselves cant necessarily explain why artefacts appear.Compounding hallucinations can lead to gibberish-spewing models. A study published in the journal Nature reveals how models, trained on error-ridden data, generate even more error-ridden data, and how this feedback loop degrades future generations of models. Models lose their grasp of more esoteric knowledge over generations, the researchers found becoming more generic and often producing answers irrelevant to the questions theyre asked.Image Credits:Ilia Shumailov et al.A follow-up study shows that other types of models, like image generators, arent immune to this sort of collapse:Image Credits:Ilia Shumailov et al.Soldaini agrees that raw synthetic data isnt to be trusted, at least if the goal is to avoid training forgetful chatbots and homogenous image generators. Using it safely, he says, requires thoroughly reviewing, curating, and filtering it, and ideally pairing it with fresh, real data just like youd do with any other dataset.Failing to do so could eventuallylead to model collapse, where a model becomes less creative and more biased in its outputs, eventually seriously compromising its functionality. Though this process could be identified and arrested before it gets serious, it is a risk.Researchers need to examine the generated data, iterate on the generation process, and identify safeguards to remove low-quality data points, Soldaini said. Syntheticdatapipelines are not a self-improving machine; their output must be carefully inspected and improved before being used for training.OpenAI CEO Sam Altman once argued that AI will someday produce synthetic datagood enough to effectively train itself. But assuming thats even feasible the tech doesnt exist yet. No major AI lab has released a model trainedAt least for the foreseeable future, it seems well need humans in the loop somewhere to make sure a models training doesnt go awry.TechCrunch has an AI-focused newsletter!Sign up hereto get it in your inbox every Wednesday.Update: This story was originally published on October 23 and was updated December 24 with more information.

0 Comments 0 Shares 161 Views