• Home
  • Blog
  • AI News
  • OpenAI Faces New Allegations Over AI Training on Copyrighted Books

OpenAI Faces New Allegations Over AI Training on Copyrighted Books

Published:April 2, 2025

Reading Time: 2 minutes

OpenAI is once again under scrutiny for allegedly training its AI models on copyrighted content without proper authorization. A new report from the AI Disclosures Project suggests that the company relied on non-public books to develop its latest and most advanced AI model, GPT-4o.

How AI Models Learn and Why Copyright Matters

AI models like GPT-4o are built on massive datasets, learning from books, articles, movies, and more to generate human-like text. When an AI writes an essay or creates an image in a specific style, it’s not truly “thinking” – it’s predicting patterns based on what it has already learned.

However, the key question is: where is this data coming from?

If AI companies are using copyrighted books without permission, it raises serious legal and ethical concerns.

New Report Claims OpenAI Used Paywalled O’Reilly Books

According to the AI Disclosures Project, OpenAI’s GPT-4o appears to recognize content from O’Reilly Media books that were never made publicly available.

The nonprofit, co-founded by Tim O’Reilly (CEO of O’Reilly Media) and economist Ilan Strauss, conducted an in-depth analysis using a method called DE-COP.

This method, known as a “membership inference attack,” tests whether an AI model has prior knowledge of specific texts by comparing human-authored excerpts with AI-generated versions. The study found that:

  • GPT-4o showed strong recognition of O’Reilly Media books published before its training cutoff date.
  • OpenAI’s older model, GPT-3.5 Turbo, recognized fewer paywalled books, suggesting that GPT-4o was trained on a different dataset.
  • Even after adjusting for improvements in AI model performance, the results pointed to potential use of copyrighted material.

Is This Proof of Copyright Infringement?

While the findings are significant, the report’s authors acknowledge that their method isn’t foolproof. There’s a possibility that OpenAI didn’t directly train on these books but instead learned from excerpts copied and pasted into ChatGPT by users.

Additionally, the report didn’t evaluate OpenAI’s latest models, such as GPT-4.5 or reasoning models like o3-mini and o1. This means there’s no clear evidence that OpenAI is still relying on paywalled books for training.

The Bigger Picture: AI Companies Are Racing for High-Quality Data

OpenAI isn’t alone in facing scrutiny over its data sources. As AI models become more advanced, companies are looking for better-quality training data. This has led some AI firms to:

  • Hire journalists and subject-matter experts to refine AI responses.
  • Pay for access to news publishers, social media platforms, and stock media libraries.
  • Implement opt-out mechanisms for copyright holders (though these systems aren’t always effective).

Despite these efforts, OpenAI has been a vocal advocate for more flexible regulations around using copyrighted content to train AI. This stance has already landed the company in multiple legal battles.

What’s Next for OpenAI?

As OpenAI faces lawsuits and increasing scrutiny over its data practices, the company has yet to respond to the allegations in the new report. Whether these claims lead to legal consequences remains to be seen, but one thing is certain – AI training practices will continue to be a hot topic in the tech industry.

Would you trust AI models trained on copyrighted content? Let us know your thoughts!

Onome

Contributor & AI Expert