Adobe Systems Incorporated is on the defensive after being hit with a proposed class-action lawsuit. The complaint alleges that the company has exploited authors’ works, especially in the pre-training of their AI model, SlimLM. The lawsuit claims that Adobe unlawfully utilized pirated versions of numerous books, including works by Oregon author Elizabeth Lyon, to train this small language model designed for on-device document assistance.
The SlimPajama-627B dataset, which was released by Cerebras in June 2023, is at the heart of the controversy. This unique augmented dataset was the basis for serving SliM-LM’s development. According to the lawsuit, Elizabeth Lyon’s writing was included in a processed subset of a manipulated dataset that underpins Adobe’s SlimLM program. According to the complaint, most or all of that dataset was developed by cloning and modifying the proprietary RedPajama dataset. This original dataset has already faced criticism for including infringing materials.
The lawsuit underscores larger fears among the tech community over the use of copyrighted content in large language model training. Similar lawsuits have followed against tech giants like Google and Facebook. Salesforce had a major PR scandal for their use of the RedPajama dataset. Anthropic just reached a $1.5 billion settlement with writers who sued the company for allegedly using their pirated works to train its chatbot, Claude. Additionally, an antitrust lawsuit against Apple asserted that the company too had based its copyrighted corpus on copyrighted training data for its own model, Apple Intelligence.
>Adobe has certainly been the most dynamic power player in the AI space. Since 2023, they’ve launched a variety of new services, including Firefly, a groundbreaking AI-powered media-generation suite. As the company continues to assert its leadership in the AI revolution, especially in generative artificial intelligence, many ethical questions loom over how content is sourced.
The lawsuit asserts that Adobe’s actions were conducted “without consent and without credit or compensation.” This has only heated debates over intellectual property rights in the era of generative AI.
“The SlimPajama dataset was created by copying and manipulating the RedPajama dataset (including copying Books3),” – Reuters and other articles.
As the case makes its way through the courts, the ramifications of this case could have far-reaching implications, even beyond Adobe itself. From legality to ethics to copyright infringement, the conversation around AI training datasets is changing quickly. How we have the discussion here matters to corporations, authors, and content creators who seek appropriate recognition and monetization for their work.

