Researchers Warn of Impending Data Shortage for AI Training by 2026

Artificial intelligence (AI) is currently at the peak of its popularity, but researchers have issued a warning: the industry may soon face a shortage of training data, which is essential for powering AI systems. This could have a significant impact on the growth of AI models, particularly large language models, and potentially change the course of the AI revolution.

But why is a lack of data a concern when there is so much available on the web? And is there a solution to mitigate this risk?

Why High-Quality Data Matters for AI

To train powerful, accurate, and high-quality AI algorithms, a substantial amount of data is required. For example, ChatGPT was trained on a massive 570 gigabytes of text data, equivalent to approximately 300 billion words.

Similarly, the stable diffusion algorithm, which powers popular AI image-generating apps like DALL-E, Lensa, and Midjourney, was trained on the LIAON-5B dataset consisting of 5.8 billion image-text pairs. If an algorithm is trained on insufficient data, it will produce inaccurate or low-quality outputs.

The quality of the training data is also crucial. Low-quality data, such as social media posts or blurry photographs, may be easily accessible but are inadequate for training high-performing AI models.

Text sourced from social media platforms can be biased, prejudiced, or contain disinformation and illegal content, which the AI model may replicate. For instance, when Microsoft attempted to train its AI bot using Twitter content, it ended up generating racist and misogynistic outputs.

This is why AI developers seek out high-quality content from books, online articles, scientific papers, Wikipedia, and curated web sources. For example, the Google Assistant was trained on 11,000 romance novels obtained from the self-publishing site Smashwords to enhance its conversational abilities.

Do We Have Enough Data?

The AI industry has been training AI systems on increasingly larger datasets, resulting in high-performing models like ChatGPT and DALL-E 3. However, research indicates that online data stocks are growing at a much slower pace compared to the datasets used for AI training.

In a paper published last year, a group of researchers predicted that if current AI training trends continue, we will run out of high-quality text data before 2026. They also estimated that low-quality language data will be depleted between 2030 and 2050, and low-quality image data between 2030 and 2060.

According to accounting and consulting group PwC, AI has the potential to contribute up to US$15.7 trillion (A$24.1 trillion) to the world economy by 2030. However, the scarcity of usable data could impede its development.

Should We Be Concerned?

While these points may raise concerns among AI enthusiasts, the situation may not be as dire as it appears. There are several unknowns regarding the future development of AI models, as well as potential solutions to address the risk of data shortages.

One opportunity is for AI developers to enhance algorithms so that they can utilize existing data more efficiently.

In the coming years, it is likely that high-performing AI systems will be trained using less data and potentially less computational power. This would not only address the data shortage risk but also reduce AI’s carbon footprint.

Another option is the use of AI to generate synthetic data for training systems. In other words, developers can create the data they need specifically tailored to their AI models.

Several projects are already leveraging synthetic content, often obtained from data-generating services like Mostly AI. This approach is expected to become more prevalent in the future.

Developers are also exploring content sources beyond the free online space, such as large publishers and offline repositories. Digitizing millions of texts published before the internet era could provide a new data source for AI projects.

News Corp, one of the world’s largest news content owners, recently announced negotiations with AI developers for content deals. Such agreements would require AI companies to pay for training data, as opposed to scraping it from the internet for free, as they have done so far.

Content creators have expressed concerns about the unauthorized use of their content to train AI models, leading to lawsuits against companies like Microsoft, OpenAI, and Stability AI. Being compensated for their work could help restore the power balance between creatives and AI companies.

Rita Matulionyte, Senior Lecturer in Law, Macquarie University

This article is republished from The Conversation under a Creative Commons license. Read the original article.