Adobe’s SlimLM case could change how every AI model is trained.

For much of the past decade, Adobe has positioned itself as a conscientious steward of the creative economy. Its software powers photographers, filmmakers, designers, and authors, many of whom depend on intellectual property protections to earn a living. That reputation is now facing scrutiny after a proposed class-action lawsuit accused the company of training one of its artificial-intelligence models on pirated books, including copyrighted works by living authors.
The lawsuit, filed on behalf of Oregon-based author Elizabeth Lyon, alleges that Adobe relied on unauthorized copies of books to train SlimLM, a small language model designed for document assistance on mobile devices. While the case may ultimately hinge on technical distinctions about datasets and derivatives, it underscores a far larger problem: the AI industry’s unresolved dependence on copyrighted material, and its growing collision with the legal and moral foundations of creative work.
The Dataset Beneath the Model
At the center of the complaint is SlimPajama-627B, an open-source dataset released in 2023 and used to pre-train Adobe’s SlimLM. Adobe has described SlimPajama as “deduplicated” and “multi-corpora,” language that suggests an attempt to distance the dataset from the more controversial sources that have plagued generative AI training pipelines.
But the lawsuit argues that SlimPajama is not as clean as advertised.
According to the filing, SlimPajama was derived from RedPajama, which itself incorporated Books3, a massive repository of approximately 191,000 books that has become shorthand for everything legally murky about AI training data. Books3 is widely believed to contain pirated versions of copyrighted works, assembled without author consent, attribution, or compensation.
“The SlimPajama dataset was created by copying and manipulating the RedPajama dataset,” the lawsuit states, “including copying Books3.” Because of that lineage, the complaint argues, SlimPajama necessarily contains copyrighted material belonging to Lyon and other authors.
If true, the implication is difficult to ignore: even companies that claim to rely on “open” or “cleaned” datasets may still be building on foundations that were never legally sound.
A Familiar Pattern Across Big Tech
Adobe is far from alone.
Over the past two years, RedPajama and Books3 have surfaced repeatedly in litigation against major technology firms. Apple has faced accusations that its Apple Intelligence system relied on similar datasets. Salesforce has confronted comparable claims. Meta, OpenAI, and Anthropic have all been drawn into disputes over training data provenance.
In September, Anthropic reportedly agreed to pay $1.5 billion to resolve claims brought by authors who alleged their books were used, without permission, to train Claude, the company’s flagship chatbot. That settlement, while not an admission of wrongdoing, sent a clear signal: copyright disputes over AI training are no longer theoretical.
They are expensive.
The Legal Gray Zone That Built AI
Generative AI did not emerge from a vacuum. It emerged from text, vast quantities of it, scraped, copied, mirrored, and repackaged from the internet, libraries, and digital archives. For years, this process operated under a loosely held assumption: that training on copyrighted material might qualify as “fair use,” even if the outputs competed with or displaced the original creators.
Courts have yet to definitively settle that question.
What has changed is the scale. Models like SlimLM, though described as “small,” are still products of industrial-scale data ingestion. When that ingestion includes entire books, sometimes thousands of them, the argument that training is merely “transformative” begins to strain.
Authors are not arguing that AI cannot exist. They are arguing that it cannot be built by quietly emptying the shelves of the world’s libraries and calling the result innovation.
Adobe’s Brand Problem
For Adobe, the lawsuit cuts especially deep.
Unlike social-media platforms or consumer AI startups, Adobe’s business is inseparable from professional creators. Its Firefly models were marketed as “commercially safe,” trained on licensed or Adobe-owned content. That positioning differentiated Adobe from competitors whose datasets were opaque at best.
SlimLM complicates that narrative.
Even if Adobe ultimately prevails in court, the reputational damage may linger. Creative professionals are increasingly wary of companies that benefit from their labor while redefining the rules of ownership beneath them.
The irony is hard to miss: a company built on protecting intellectual property now finds itself accused of exploiting it.
Why This Case Matters More Than It Appears
At first glance, this may look like another skirmish in a crowded field of AI copyright lawsuits. In reality, it points to a deeper reckoning.
The AI industry has matured faster than its ethical or legal frameworks. Startups raced ahead using whatever data they could find. Larger companies followed, sometimes assuming that distance from the original dataset, or reliance on an intermediary, provided sufficient insulation.
The Adobe case challenges that assumption.
If derivative datasets inherit the legal liabilities of their sources, then much of today’s AI ecosystem rests on unstable ground. Cleaning, deduplicating, or reformatting data may not be enough if the original material was never lawfully obtained.
That realization could force a fundamental reset: toward licensed data, revenue-sharing models, or entirely new approaches to training that do not rely on wholesale appropriation.
The Road Ahead: Regulation or Reckoning
Lawmakers have so far struggled to keep pace with generative AI. But the pressure is mounting. As courts weigh these cases, companies face a choice: litigate endlessly, or redesign how AI is built.
For authors, the stakes are existential. For technology companies, they are strategic. For society, they are cultural.
AI will shape how knowledge is created, distributed, and valued. Whether it does so by collaborating with creators, or by consuming them, remains an open question.
Adobe’s lawsuit does not answer that question. But it makes something clear: the era of quietly training models on borrowed words is coming to an end.

