TECHCRUNCH.COM
Elon Musk agrees that weve exhausted AI training data
Elon Musk concurs with other AI experts that theres little real-world data left to train AI models on.Weve now exhausted basically the cumulative sum of human knowledge . in AI training, Musk said during a livestreamed conversation with Stagwell chairman Mark Penn streamed on X late Wednesday. That happened basically last year.Musk, who owns AI company xAI, echoed themes former OpenAI chief scientist Ilya Sutskever touched on at NeurIPS, the machine learning conference, during an address in December. Sutskever, who said the AI industry had reached what he called peak data, predicted a lack of training data will force a shift away from the way models are developed today.Indeed, Musk suggested that synthetic data data generated by AI models themselves is the path forward. The only way to supplement [real-world data] is with synthetic data, where the AI creates [training data], he said. With synthetic data [AI] will sort of grade itself and go through this process of self-learning.Other companies, including tech giants like Microsoft, Meta, OpenAI, and Anthropic, are already using synthetic data to train flagship AI models. Gartnerestimates 60% of the data used for AI and analytics projects in 2024 were synthetically generated.Microsofts Phi-4, which was open-sourced early Wednesday, was trained on synthetic data alongside real-world data. So were Googles Gemma models. Anthropic used some synthetic data to develop one of its most performant systems,Claude 3.5 Sonnet. And Meta fine-tuned its most recentLlamaseries of modelsusing AI-generated data. Training on synthetic data has other advantages, like cost savings. AI startup Writer claims its Palmyra X 004 model, which was developed using almost entirely synthetic sources, cost just $700,000 to develop comparedto estimates of $4.6 million for a comparably-sized OpenAI model.But there as disadvantages as well. Some research suggests that synthetic data can lead to model collapse, where a model becomesless creative and more biased in its outputs, eventually seriously compromising its functionality. Because modelscreatesynthetic data, if the data used to train these models has biases and limitations, their outputs will be similarly tainted.
0 Σχόλια 0 Μοιράστηκε 44 Views