Researchers at the Stanford Internet Observatory have detected child sexual abuse material in one of the large databases used to train image-generating artificial intelligences, some as popular as Stable Diffusion. Models trained on this dataset, known as LAION-5B, are being used to create photorealistic nude images through AI, including some of child sexual exploitation.

LAION-5B is a database with 5.85 billion text-image links extracted from the Internet that are used for training generative AIs dedicated to image creation. Their purpose, as they explain on their website, is to “democratize research and experimentation in the training of large-scale multimodal models”, although they admit that their database is unsupervised and that the “non-selected nature of the data set” can result in “very uncomfortable and disturbing content.”

Members of the Stanford Internet Observatory (SIO) have discovered that it contained 3,226 images of sexually abused children among all that training data, of which 1,008 were externally validated.

An earlier report by SIO and Thorn, a nonprofit group that monitors child online safety, found that rapid advances in machine learning were making it possible to create realistic images that facilitate child sexual exploitation using AI imaging models. open source. And what this new research reveals is that these models are trained directly with images of child sexual abuse present in open access databases such as LAION-5B.

The dataset examined included child exploitation content extracted from a wide variety of sources, including mainstream social media websites and popular adult video platforms, as explained in a note released by the Observatory.

Those working in the field of artificial intelligence ethics have long warned that the massive scale of these technological tools’ training data sets makes it materially impossible to filter them or audit the AI ??models that use them.

However, technology companies, eager to position themselves in the growing generative AI market, have largely ignored these concerns and have built and continue to build their products on AI models trained on these massive data sets.

“We are facing a situation that is not new and that will be repeated due to the lack of transparency both at the data governance level and at the technical level about the type of data that is used to train AI systems, especially LLM or great language models”, explains Albert Sabater Coll, director of the Chair – Observatory of Ethics in Artificial Intelligence of Catalonia, at the University of Girona, in relation to the presence of data on child sexual abuse in the LAION database.

And he emphasizes that, although by itself it is not enough to guarantee that the data with which AI is trained is correct and appropriate, “transparency not only fosters a culture of responsibility and ethical development of artificial intelligence but is also key to build effective, fair and reliable systems.

Because, remember Sabater, AI systems are only as good as the data they are trained on, and if the data set is biased, the AI ??will likely perpetuate or even amplify those biases.

For this reason, he considers that governments and international organizations must enact regulations “in which transparency over data sets is a kind of Kantian imperative and there must always be information about where the data comes from, how it was collected and whether it respects human rights.” privacy of the people whose data could be included.

According to the director of the AI ??Ethics Observatory of Catalonia, “only in this way will it be possible to understand and identify why an AI system could be making errors or behaving in an anomalous or unacceptable way due to the training data.”

Regarding the research conducted by the Stanford Internet Observatory, the identified material is currently being removed, as the researchers reported the URLs of the images to the National Center for Missing and Exploited Children (NCMEC) in the United States and to the Canadian Center for Child Protection (C3P).

The study was conducted using alphanumeric tools that compare an image’s fingerprint to databases from nonprofit organizations that receive and process reports of online child sexual exploitation and abuse, and the researchers suggest these practices would be a good method. to clean and minimize the presence of this type of content in the data sets used to train AI models.

LAION, for its part, has temporarily withdrawn its database to ensure that “they are safe before republishing them,” according to sources from the firm declared to the technology portal 404 Media.