What Happens When We Train AI on AI-Generated Data?

In the world of artificial intelligence (AI) and large language models (LLMs), finding appropriate training data is the core requirement for building generative solutions. As the capabilities of Generative AI models like Chat GPT, DALL-E continues to grow, there is an increasing temptation to use their AI-generated outputs as training data for new AI systems. However, recent research has shown the dangerous effects of doing this, leading to a phenomenon called “model collapse.” In a study published in July 2023, scientists at Rice and Stanford University concluded that training AI models exclusively on the outputs of generative AI is not a good idea. They titled their report: “Self-consuming generative models go MAD.”

Whenever we are training an AI model on data which is generated by other AI models, it is essentially learning from a distorted reflection of itself. Just like a game of “telephone,” each iteration of the AI-generated data becomes more corrupted and disconnected from reality. Researchers have found that introducing a relatively small amount of AI-generated content in the training data can be “poisonous” to the model, causing its outputs to rapidly degrade into nonsensical gibberish within just a few training cycles. This is because the errors and biases inherent in the synthetic data get amplified as the model learns from its own generated outputs.

The problem of model collapse has been observed across different types of AI models, from language models to image generators. Larger, more powerful models may be slightly more resistant, but there is little evidence that they are immune to this issue. As AI-generated content proliferates across the internet and in standard training datasets, future AI models will likely be trained on a mixture of real and synthetic data. This forms an “autophagous” or self-consuming loop that can steadily degrade the quality and diversity of the model’s outputs over successive generations.

Researchers at Rice University and Stanford University conducted a thorough analysis of self-consuming generative image models where models were trained on their own synthetic outputs. They identified three main types of self-consuming loops:

Fully Synthetic Loops: In these loops, models are trained solely on synthetic data generated by previous models. Researchers found that these fully synthetic loops inevitably lead to Model Autophagy Disorder (MAD), with either the quality (precision) or diversity (recall) of the generated images progressively decreasing over successive generations. For example, training was done on two identical facial image generators in fully synthetic loops – one with and one without “sampling” biases that boost synthetic quality at the cost of diversity. Without the biases, the generated images developed wave-like artifacts that decreased realism (quality). With the biases, the image maintained high quality but became less and less diverse, eventually converging to just a few nearly identical faces.

Synthetic Augmentation Loops: These loops incorporate a fixed set of real training data along with the synthetic data. Researchers found that this can delay but not prevent the onset of MAD. The real data improves performance initially, but the synthetic data eventually dominates and leads to a decline in quality or diversity.

Fresh Data Loops: In these loops, each generation of the model has access to a new, previously unseen set of real training data. Researchers found that this can prevent MAD and maintain both the quality and diversity of the generated images over successive generations. The key factor is the availability of sufficient fresh real data in each generation. Without enough new real data, self-consuming generative models are doomed to suffer from MAD, where their outputs progressively degrade in quality or diversity. In summary, this case study demonstrates how self-consuming generative models can fall victim to Model Autophagy Disorder, with their synthetic outputs degrading over time unless they have access to a steady supply of new real-world training data.

Recently, prominent figures in the AI industry made commitments at the White House to introduce strategies such as watermarking for distinguishing synthetic data from authentic data. The proposed watermarking approach would embed a technical marker within synthetic content, such as deep-fake images or audio. This watermark is intended to make it easier for users to identify when content has been artificially generated, rather than capturing real-world events. These endeavors are ultimately geared towards addressing the adverse impacts of synthetic data on the internet. In relation to Model Autophagy Disorder (MAD), watermarking could serve as a preventive measure to stop generative models from being trained on AI-generated data. Nonetheless, the effectiveness of such approaches in tackling MADness is yet to be determined and requires further investigation.

Researchers also emphasize the crucial importance of maintaining a representative balance of real and synthetic content in the training data, with the minority groups properly preserved. Companies will need to carefully curate their datasets and monitor for signs of degradation. Training data should be diverse, and representative of different perspectives and special effort should be made to incorporate data sources that are typically underrepresented in digital landscape. Otherwise, we risk a future where AI systems become increasingly divorced from reality, with outputs that are biased, unreliable, and nonsensical. This could have serious consequences across many domains, from content generation to decision-making systems. It is true that as humans, we are consuming AI generated stuff extensively in our lives, but as humans we have coping mechanisms which AI systems probably doesn’t have.

The lessons from this research echo past cautionary tales, like the spread of radioactive fallout contaminating newly produced steel. Just as we had to be vigilant about the purity of our materials, we must now be equally careful about the purity of our AI training data. Through responsible data curation and monitoring, we can hopefully steer the development of AI in a direction that remains grounded and serves the diverse needs of all communities. The alternative is a dystopian future where our AI tools become increasingly “mad,” no longer fit for purpose.

About the Author

Ranjeeta Bhattacharya is a senior data scientist within the AI Hub wing of BNY Mellon, the world’s largest custodian bank. My total experience as a Data Science / Technology consultant spans over 15+ years where I’ve performed multi-faceted techno-functional roles in the capacity of software developer, solution designer, technical analyst, delivery manager, project manager, etc. for IT Consulting Fortune 500 companies across the globe. I have an undergraduate degree in Computer science and engineering, a master’s degree in data science, and multiple certifications and publications in these domains. demonstrating my commitment to continuous learning and knowledge sharing.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: https://twitter.com/InsideBigData1

Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/

Join us on Facebook: https://www.facebook.com/insideBIGDATANOW

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31