AI: AI-generated training data is at risk of collapse

0
24
AI: AI-generated training data is at risk of collapse


AI models can choke themselves. They can go completely out of action if they are fed their own AI-generated data. Researchers at Rice University in Houston, Texas, played out this scenario. They used it for themselves Study “Self-consuming productive models go crazy” (PDF) Generative image generation that can visualize the problem in an understandable way. The study focuses on generative image models such as the popular DALL·E 3, Midjourney, and Stable Diffusion. In the study, the researchers demonstrated that the generated images deteriorate after a few iterations of the underlying model when AI-generated images are used to train new generations of AI.

Advertisement


Richard Baraniuk, professor of electrical and computer engineering at Rice University, explains: “The problem arises when this training with synthetic data is repeated over and over again and this creates a kind of feedback loop. This is what we call an autophagic or self-consuming loop.” His research group is working on such feedback loops. Baraniuk: “The bad news is that after a few generations of such training, new models can be irreparably damaged. This has been called model collapse by some, for example by colleagues in the context of large language models (LLMs). However, we think the term ‘model autophagy disorder’ (MAD), based on mad cow disease, is more appropriate.”

Mad cow disease is a neurodegenerative disease fatal to cows whose human counterpart is contracted by eating infected meat. The disease gained widespread attention in the 1980s when it emerged that cows were being fed the processed remains of their slaughtered counterparts – hence the term “autophagy”, coming from the Greek “auto”, meaning “self”, and “phagy”, meaning “to eat”.

The hunger for data to train new artificial intelligence (AI) models such as OpenAI’s GPT-4 or Stability AI’s Stable Diffusion is enormous. It can be predicted that AI will gradually integrate large amounts of text, images and other content into their training that are not human-made. So, following the image of mad cow disease, they are fed their own data.

US appeals court blocks Biden administration’s net neutrality rulesUS appeals court blocks Biden administration’s net neutrality rules

Researchers at Rice University examined three scenarios of such self-consuming training loops, aiming to provide a realistic representation of how real and synthetic data are combined into training datasets for generative models.

In a fully synthetic loop, successive generations of a generative model were fed with a synthetic data diet drawn entirely from the outputs of previous generations. However, in a synthetic reinforcement loop, the training data set for each generation consisted of a combination of synthetic data from previous generations and a fixed set of real training data. In the third scenario, the fresh data loop, each new model received a mix of synthetic data from previous generations and a fresh set of real training data for training.



Fifth-generation AI portrait after training with AI data

AI portraits are becoming more and more similar in the fifth generation after training with AI data

(Image: Study: “Self-consuming generative models go crazy” ((Source: https://arxiv.org/abs/2307.01850),

Progressive iterations of the loop showed that the models generate increasingly distorted images over time, the more distorted they become the less fresh data they receive for training. If you compare successive generations of image datasets, you can see the progressive impoverishment: facial images become increasingly filled with grid-like marks – what the authors call “generative artifacts” – or they stop looking like the same person. Data sets consisting of numbers turn into unintelligible scribbles.

The problem is also exacerbated by human actions themselves. For example, photographs of plants predominantly contain flowers, the people photographed are more likely to smile than in ordinary life, and photos of holidays in the mountains usually show sun and snow. When training AI with such data, it can conclude that most plants are flowers – which is not the case. She can assume that people smile a lot – which they do not – and there is always blue sky in the mountains. After a few model generations, AI generators are no longer able to depict wheat stalks, crying children or a rain shower while hiking in the mountains.

Just as the gene pool is shrinking due to the extinction of animal and plant species, the range of what AI generators can produce on their own is also shrinking.

AI developers no longer only have the problem of what data they are allowed to use. According to the study, the convenient way to use AI-generated data for training appears to be business model suicide in installments. In their own interest, AI developers should be keen not to use AI data to train future models so that their AI generators can continue to work for a long time. Indeed, companies have to agree on standards, but this does not seem to be happening. At least a clear labeling of content generated using AI tools on the Internet is clearly needed not only for consumers, but also for the developers themselves.

The data available for training is already so scarce that AI-generated content has long been used for this purpose – the risk of contamination of the “data mad cow disease” is increasing. If AI content is always labeled, companies can exclude it from training to protect their new AI generator models. However, companies will have to satisfy their hunger for data differently. They will then have to rely exclusively on content created by humans and better recognize content produced using AI. Against this background, the question of remuneration for using such data for training arises again: obviously, human-generated content will remain valuable.


(Mill)

Minister Wissing: IT breakdowns will increaseMinister Wissing: IT breakdowns will increase

LEAVE A REPLY

Please enter your comment!
Please enter your name here