Use data generated by artificial intelligence (AI) could be a death sentence to train those systems. This was revealed by a New Study published this Wednesday in the scientific journal Nature, warns that the models being fed back machine learning Mixing with synthetic data “inevitably” contaminates its results, a poisoning known as model collapse.
The research was led by computer scientist Ilya Shumailov, who works Google DeepMindshows that the recursive loop feeding is involved Hey AI-generated data alone impairs the system’s learning ability, corrupts its operations, and provides incorrect information, replacing the original content with “unrelated nonsense”.
Over the past year and a half, the tech ecosystem has experienced first-hand the emergence of the popularity of the so-called Generative AIThe system is based on Great linguistic models (LLM) is trained with data extracted from Internet Being able to generate all kinds of content, from written messages to images and sounds. For now, most of those tools, such as ChatGPTtrained with man-made materials.
The “inevitable” collapse
However, the fever sparked by generative AI in the sector and its accelerated deployment could change that reality. “As these LLMs gain greater adoption, more synthetic data ends up on the Internet, which could hypothetically affect the training of future versions,” warns Pablo Haya Coll, a researcher at the Computer Linguistics Laboratory of the Autonomous University of Madrid (UAM) and director of the Business and Language Analytics (BLA) area of the Institute of Knowledge Engineering (IIC).
Thus, the authors of the study investigate the risks that AI models face when trained only with synthetic material, that is, artificially generated. One of their tests was with a text on medieval architecture and, up to the ninth generation, the result obtained by the machine was “a list of rabbits”.
The use of synthetic data to train these models means that, according to Victor Etxebarria, professor of systems engineering and automation at the University of the Basque Country (UPV/EHU), the AI “does not perform any really credible tasks”, which turns them into tools “not only useless in helping us solve our problems, but also potentially harmful.”
Theoretical Studies
For his part, Andreas Kaltenbrunner, lead researcher of the AI and Data for Society group at the Universitat Oberta de Catalunya (UOC), regrets that, despite being of “good quality”, the value of the study is “at the theoretical level”, because its conclusions are based on the assumption that future AI models will only be trained with synthetic data. “It is not clear what the result will be if data generated by humans are mixed with data generated by AI and it is even less clear what will happen if data generated (increasingly frequent) in a hybrid way between AI and humans are also added,” he adds.
Beyond that hypothetical scenario, the study demonstrates through the use of mathematical models that AI can train itself with only a part of the training data set, thus ignoring other results, leading to the collapse of the model. Shumailov assures that it is not impossible to train AI models with artificial data if it is filtered first.