The Internet May Not Be Enough Data for AI

Artificial intelligence (AI) companies are facing a growing problem: the internet is not big enough to provide all the data they need to train their models.

Data is essential for AI development. The more data an AI model is trained on, the smarter it becomes. However, natural (human-generated) data appears to be a finite resource, and it may one day run out.

The think tank Epoch AI has predicted that AI companies will likely face a shortage of high-quality text training data by 2026.

According to the Wall Street Journal, some companies are looking for alternative sources of data to train their models, as the growth of content on the internet is not keeping pace with demand. They are considering options such as video subtitles and even AI-generated data.

OpenAI is said to have developed its GPT-5 model using subtitles from public YouTube videos. Mira Murati, OpenAI’s CTO, recently declined to answer when asked if YouTube content was used to train the company’s Sora AI model.

The use of synthetic data to train AI models has been a topic of much debate in recent months, with some researchers finding that training AI models on data that has been generated by other AIs can lead to model collapse or the creation of flawed results.

Some companies, such as OpenAI and Anthropic, the creator of the AI model Claude, are looking to create higher-quality synthetic data to avoid training models on garbage data. However, neither company has disclosed many details about their projects.

Anthropic said at the launch of Claude 3 that the model was trained on data that it had generated itself. Jared Kaplan, Anthropic’s chief scientist, also told the Wall Street Journal that he believes there are many potential uses for synthetic data.

“In the next five years, applications and devices will become less artificial and more intelligent,” according to Harvard Business Review. “They will rely less on learning from big data and more on reasoning from the whole to the part, which is similar to how humans solve problems and perform tasks. The power of reasoning could open up a wider range of applications for AI.”

Back to top button