What is AI image scraping, and how can artists fight back?

AI-generated art has been around for some time, but in the last year it’s well and truly taken over the internet. Despite fears that artificial intelligence will eclipse humans in other forms of “creativity” (see: ChatGPT’s uninspired prose and “grotesque” songwriting in the style of Nick Cave) visual culture has by far borne the brunt of the robot uprising thanks to the widespread popularity and accessibility of text-to-image generators such as DALL-E 2, or apps like Lensa, which can transform your selfies into AI dreamscapes at the click of a button.

Even virtual artists have to start somewhere, though. Before they can churn out their own uncanny artworks, AI-powered models like DALL-E, Midjourney, Lensa, and Stable Diffusion have to be “trained” on billions of images, just like a human artist taking inspiration from art history. Where do these images come from? They’re taken – or “scraped” – from the internet, of course.

In other words, AI art tools rely on human-made images for training data, which are collected by the millions from various online sources. Unsurprisingly, people aren’t always happy that their data is being harvested, however, and now they’re starting to push back.

In the last week, Meta filed a complaint against the surveillance startup Voyager Labs for scraping its user data, and Getty Images similarly announced that it’s suing Stable Diffusion creators Stability AI for illegally scraping its content. Then, there’s the artists taking the fight into their own hands, with a class action lawsuit filed against Stability AI, Midjourney, and DeviantArt for using their work to train the companies’ respective image generators.

But why is scraping considered such bad news by many artists, and why are multi-billion dollar companies like Meta getting involved? First, let’s cover some basics...

WHAT IS SCRAPING, EXACTLY?

Basically, scraping the internet involves creating software that automatically collects data from various sources, including social media, stock image sites, and (maybe most controversially) sites where human artists showcase their work, such as DeviantArt. In the case of AI image generators, this software is generally looking for image-text pairs, which are compiled into vast datasets.

Some companies are completely transparent about the datasets they use. Stable Diffusion, for example, uses a dataset put together by the German charity LAION. “LAION datasets are simply indexes to the internet, i.e. lists of URLs to the original images together with the ALT texts found linked to those images,” the company explains on its website.

Other owners of image generators, such as OpenAI (DALL-E) or Midjourney, haven’t yet made their datasets public, so we don’t know exactly what images the AI is trained on. Given the quality of the output, though, they’re thought to be pretty extensive.

HOW IS THE DATA USED TO ‘TRAIN’ IMAGE GENERATORS?

The billions of text-image pairs stored in these massive datasets essentially form a knowledge base to teach image generators how to “create” images for themselves. This teaching process involves getting the AI to make connections between the composition and visual data of an image, and the accompanying text.

In a process called “diffusion”, the AI is then shown increasingly of blurry or “noisy” images, and taught to reconstruct the original image out of the visual noise. Eventually, using this method, it will be able to construct images that have never existed before. It can only do this, however, if it’s gone through the process of copying billions of images that are already floating around online.

stable diffusion 2.1 is pushing the ceiling on the uncanny valley #AIart pic.twitter.com/zmlQnHQOVB
— Claire Silver 🌸 (@ClaireSilver12) January 12, 2023

WHAT DOES IT MEAN FOR ARTISTS?

Because artists’ original work – shared on social media, dedicated art-hosting websites, or anywhere else online – often falls into the vast datasets that are used to train AI such as text-to-image generators, they often fear that their work is being ripped off. These fears aren’t unfounded.

On the Stable Diffusion website, it openly states that artists aren’t given a choice about whether their work is scraped. “There was no opt-in or opt-out for the LAION 5b model data,” it says, referring to the data it’s trained on. “It is intended to be a general representation of the language-image connection of the Internet.”

For the most part, criticisms of this appropriation revolve around the theft of artists’ labour, and the fact that AI image generators could gradually replace them in professional roles. After all, why would a company commission an artist when it can type their name and get AI to churn out similar artworks for free? On the other hand, some artists suggest that the ability to scrape the entire contents of the internet will lead to more creative freedom, or even help develop new forms of creative expression.

WHO IS FIGHTING BACK?

In some cases, companies – or even entire countries – are trying to crack down on indiscriminate scraping through laws and legislation, though the exact rules for this relatively new practice remain hazy.

On January 17, for example, Getty Images started legal proceedings against Stability AI, claiming that its machine learning model “unlawfully copied and processed millions of images” protected by copyright. In a statement, Getty Images goes on to say that it believes “AI has the potential to stimulate creative endeavours” but that Stability AI didn’t seek licence to scrape the Getty collection for its own commercial use.

Last week, meanwhile, Meta filed a complaint against the surveillance startup Voyager Labs, claiming that it improperly harvested data from its social media sites Facebook and Instagram, as well as others including Twitter, YouTube, Twitter, and Telegram. To scrape data, Voyager Labs apparently created over 38,000 fake profiles, extracting public information from more than 600,000 other users without their consent. Meta is asking the company to stop, as well as giving up its profits and paying damages.

WHAT CAN ARTISTS DO?

At the same time that high-profile cases from the likes of Meta and Getty Images are taking place, there’s the coalition of artists taking legal action against some of the biggest art generator giants. In a complaint filed in the United States District Court of the Northern District of California on January 13, artists Karla Ortiz, Kelly Mckernan, and Sarah Andersen claim that Stability AI, Mdjourney, and DeviantArt have violated copyright laws by using their images – plus art by tens of thousands of other artists – to feed their image generators.

“Though the rapid success of Stable Diffusion has been partly reliant on a great leap forward in computer science,” the complaint reads, “it has been even more reliant on a great leap forward in appropriating images.”

Besides legal action and campaigning for legislation to tighten scraping laws, however, there isn’t a whole lot that artists can do to protect their work at the moment, other than taking it completely offline. For many artists, of course, that simply isn’t an option.