Where is Apple Intelligence Getting Its Data? Inquiring Artists Want to Know

By Jesse Hollington

5 Min Read Published: Jul 3rd, 2024

Text Size

- +

Toggle Dark Mode

When Apple showed off its new Apple Intelligence features at last month’s Worldwide Developers Conference (WWDC), it promised a whole new era of on-device AI processing that would let iPhone, iPad, and Mac users summarize text, have more natural conversations with Siri, and generate new images and emoji.

However, this last point has left some members of the creative community concerned. Apple has been quick to talk about the many cool things that Image Playground and Genmoji can do, but it’s been far less forthcoming about where its generative AI models are getting the smarts they need to pull off these feats.

This Limited-Time Microsoft Office Deal Gets You Lifetime Access for Just $39

Sick and tired of subscriptions? Get a lifetime license for Microsoft Office Home and Business 2021 at a great price!

The folks at Engadget spoke with several artists and creators who find Apple’s lack of transparency regarding its AI models distressing, especially for a company that’s traditionally had such a great relationship with creatives.

Creatives have historically been some of the most loyal customers of Apple, a company whose founder famously positioned it at the “intersection of technology and liberal arts.” But photographers, concept artists and sculptors who spoke to Engadget said that they were frustrated about Apple’s relative silence around how it gathers data for its AI models.Pranav Dixit, Engadget

While the ability of generative AI models to create entirely new images out of thin air may seem magical, they aren’t making this stuff up out of whole cloth. Just like human intelligence, artificial intelligence models have to be trained to do what they do, and that training typically involves feeding them vast amounts of data.

For many companies, that’s simply a matter of hoovering up whatever can be found on the public internet. This has often been done with no regard for intellectual property rights, and as Engadget notes, “consent or compensation be damned.”

cloud data centre — Billion Photos / Shutterstock

After all, there are billions of images freely available on the internet, but just because humans can find them and look at them doesn’t mean it’s appropriate to use them to train AI models — any more than it would be to grab a photographer’s image from their website and use it somewhere else without their express permission.

While this may have been a slightly ethical gray area in the days when open-source AI models were being trained purely for non-profit scientific research, companies are now profiting from these well-trained models, which means that they’re indirectly profiting from other people’s hard work.

The legal aspects of this are still being worked out, but with dozens of lawsuits making their way through the courts, both precedent and policy will soon be set. These also aren’t just small creatives suing big tech companies; the big record labels are coming down hard on AI startups that are using their artists’ songs to create “new” AI-generated musical works.

Last month, a report in the Wall Street Journal (Apple News+) detailed a lawsuit by Universal, Sony, and Warner against AI startups Suno and Udio, alleging that the two companies used copyrighted works scraped from the internet to train their AI models to create “sound-a-likes of recordings” from famous artists, with vocals that the Recording Industry Association of America (RIAA) says “are indistinguishable from famous artists, including Lin-Manuel Miranda, Bruce Springsteen, Michael Jackson and ABBA.”

This is also one of the inherent problems with generative AI at this point. It’s much better at copying than it is at creating.

Enter Apple Intelligence…

That may be partly why Apple has shied away from creating realistic images in its new Image Playground and Genmoji features. From what we’ve seen, both features create almost Pixar-like animations. Those might reduce the chances of Apple Intelligence blatantly ripping off the work of others, but it still doesn’t let it entirely off the hook.

The biggest concern among creatives is that Apple has failed to have the ethics conversation at all. Many felt that Apple, of all companies, would do better and still believe it should step up to the plate and be open about the sources of data used to train its AI models.

1/ Apple “Intelligence” is here and 0 questions of “where does the data come from?” to be seen in press.

APPLE is trying to shove a huge privacy risk and tech that screams scraped off the internet without consent to the public. So here’s a list of potential data sources ? pic.twitter.com/2WBzRSjsh3— Karla Ortiz (@kortizart) June 10, 2024

The same day it showed off Apple Intelligence at WWDC, Apple posted an article on its Machine Learning Research blog explaining that it trains on licensed data that it’s paid for rights to train with, along with publicly available data from the web.

We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot. Web publishers have the option to opt out of the use of their web content for Apple Intelligence training with a data usage control.Apple

However, it hasn’t said much beyond that. While using licensed data is laudable, the problem is that “publicly available” doesn’t mean royalty-free, and it’s a distinction that’s difficult to make when scraping billions of pieces of information from the public internet.

Apple’s senior vice president of AI and machine learning, John Giannandrea, has said that Apple has created a lot of its own training data but hasn’t said what type of data that is. Apple has ?reportedly signed a licensing deal with Shutterstock?and possibly also with Photobucket, but it hasn’t confirmed those deals. Presumably, they’re part of the “licensed data” the company is referring to in its blog post.

Then there’s the concern that while Apple rightfully allows web publishers to opt out, that’s like closing the barn door after the horse has left. The opt-out alone is based on trust, and there’s no straightforward process for removing data that may have already been gathered before the site chose to exclude the AppleBot. It’s not even clear if it’s possible to make an AI model “forget” or “unlearn” a specific piece of information.

8/ The bottomline is, we know Generative Ai to function as is, relies on massive overreach and violations of rights, private and intellectual. This is true for all GenAi companies, and as Apple pushes this tech down our throats it’s important to remember they are not an exception— Karla Ortiz (@kortizart) June 10, 2024

Ultimately, what has many artists and other creative professionals upset isn’t necessarily that Apple is scraping the open web for its data, but rather the fear that it could be just as bad as the rest of the big tech companies — and its lack of transparency about where its data comes from only magnifies those fears.

Apple needs to do better. If it’s truly approaching generative AI training using only ethically sourced materials, it should shout that from the rooftops every bit as loudly as its environmental initiatives and its focus on user privacy. The fact that it’s not is problematic.