High-Quality Data Network: AI Thesauri Pollution Threatens Future Models
amsterdam, dinsdag, 24 juni 2025.
Generative AI is causing an increasing pollution of online data through hallucinations, synthetic information, and fake news. This unreal material, in turn, feeds new AI models, reducing the quality of data. Researchers and technologists, including the CTO of Cloudflare, are urging the preservation of unpolluted data from before 2022 to secure the future of reliable AI models.
The Threat of Polluted Data
Generative AI plays a crucial role in the creation of new information, but this technology also brings significant risks. Hallucinations, synthetic data, and fake news are being generated more frequently, leading to a rapid decline in the quality of online data. This unreal information, in turn, feeds new AI models, creating a negative spiral. Researchers and technologists, including John Graham-Cumming, the CTO of Cloudflare, emphasise the importance of preserving unpolluted data from before 2022 to secure the future of reliable AI models [1][2].
Historical Comparison: Polluted Metal
Graham-Cumming draws a parallel with the explosion of the first atomic bomb during the Trinity test in New Mexico in 1945. Just as microscopic particles from atomic bombs polluted the air and influenced the production of metal with increased background radiation, so too does unreal information pollute the quality of data that AI models use. Cleaning this data is an extremely costly affair, and mandatory labelling of AI-generated data is practically impossible [1].
The Role of Scientists
Scientists also express their concerns about data pollution. A group of researchers wrote in December last year about their fear that many AI models will ultimately succumb to this problem. They advocate for the preservation of data from before 2022, before the generative AI explosion, to ensure that the data contains minimal pollution [1].
Power Factors and Access
Cleaning data is a challenging task, and access to clean data may become a power factor in the future. According to Maurice Chiodo, a researcher at the Centre for the Study of Existential Risk in Cambridge, cleaning data is difficult and expensive. Only large established organisations and governments may have the necessary resources to collect large amounts of clean data from before 2022 [1].
Practical Tips for Readers
To recognise fake news, it is important to be critical and consult multiple sources. Here are some practical tips for readers:
- Check the Sources: Ensure that the information comes from reliable and verifiable sources.
- Look at the Date: Check when the article was published and if it has been recently updated.
- Seek Multiple Sources: Compare information from different sources to get a balanced view.
- Check the Writing Style: Look out for linguistic errors or an overly emotional tone, which are often characteristic of fake news.
- Fact-Check: Use fact-check websites to verify the accuracy of information [1][2].
Implications for Media Literacy and Democracy
The spread of fake news has both direct and indirect implications for media literacy and democracy. It reduces public trust in media and government institutions, leading to polarisation and division. Therefore, it is crucial that both individuals and organisations actively work to recognise and combat fake news. Education and training in media literacy can play a vital role [1][2][3].