Tesla CEO Elon Musk has announced that all available human data for AI training, including books and other resources, was exhausted last year. This statement aligns with insights from various experts in the field. Musk shared these thoughts during a livestream conversation with Stagwell chairman Mark Penn, broadcast on X.
Former OpenAI chief scientist Ilya Sutskever had previously indicated in December that the AI industry had reached what he termed “peak data,” suggesting that the scarcity of training data would necessitate a shift in how models are developed.
Musk emphasized that the next frontier for AI training lies in synthetic data—data generated by AI itself.
“AI is advancing on both hardware and software fronts, and now it’s moving to synthetic data because we’ve run out of all human data.”
“We’ve literally exhausted the entire internet, all books ever written, and all interesting videos. We’ve now depleted the cumulative sum of human knowledge in AI training, which happened last year. The only way to supplement this is with synthetic data created by AI.”
He elaborated on this process, explaining that AI can generate content such as essays or theses and then evaluate its work through a self-learning mechanism. However, Musk acknowledged the challenges associated with using synthetic data, particularly in verifying the accuracy of its outputs.
“This is always challenging because how do you know if an answer is hallucinated or real? It’s difficult to establish ground truth,” he noted.
Concerns have also been raised by researchers about the potential risks of relying heavily on synthetic data, including model collapse, where AI systems become less creative and more biased in their outputs, ultimately jeopardizing their functionality.
Major tech companies such as Microsoft, Meta, OpenAI, and Anthropic are already leveraging synthetic data to train their flagship AI models.
According to Gartner, approximately 60% of the data used for AI and analytics projects in 2024 will be synthetically generated.
For instance, Microsoft’s Phi-4 model was trained using both synthetic and real-world data. Similarly, Google’s Gemma models incorporated synthetic data in their development. Anthropic utilized some synthetic data to enhance its Claude 3.5 Sonnet system, while Meta fine-tuned its latest Llama series using AI-generated datasets.
As the industry navigates this new landscape of synthetic data, it faces both exciting opportunities and significant challenges that will shape the future of AI development.