OpenAI "Shocked" That an AI Company Might Have Trained on Data Without Permission - Oh, the Irony

Silicon Valley has a new contender for the “Lack of Self-Awareness” award. OpenAI is reportedly outraged that Chinese AI startup DeepSeek may have trained its model using OpenAI’s outputs. That’s right: the company that built its empire on hoovering up internet data without permission is now clutching its pearls over someone else doing the same thing.
When the Data Scraper Gets Scraped
According to reports from The Verge and 404 Media, OpenAI is investigating whether DeepSeek trained its AI model using ChatGPT’s outputs (a process known as distillation) and whether this violates OpenAI’s terms of service. The alleged crime? Using data without authorization or compensation.
To put that in perspective, OpenAI’s own language models were trained on vast swaths of internet content, much of which was scraped without explicit permission from publishers, authors, or websites. Entire lawsuits have been filed against OpenAI for using copyrighted books, news articles, and other content without compensation. But now that OpenAI thinks someone might be doing the same to them, suddenly it’s a problem.
“That’s Different Because… Reasons”
OpenAI’s defenders will argue that training on raw internet data is different from training on ChatGPT’s responses because the latter represents some sort of refined intellectual property. But if DeepSeek’s AI is tainted by OpenAI’s outputs, then OpenAI’s own models are arguably tainted by the books, articles, and human-generated content they absorbed without permission.
And let’s not forget: OpenAI has never publicly disclosed the full extent of the data sources used to train its models. We know it has consumed massive amounts of text from news sites, Wikipedia, books, and social media. OpenAI has also cut deals with some publishers - after the fact - to make its scraping seem more legitimate. But for years, the company operated under the “ask for forgiveness, not permission” model of data collection.
Now, when someone else appears to be following that same playbook, OpenAI is furious.
The AI Industry’s Wild West Just Got Wilder
If OpenAI takes action against DeepSeek, it could open up an even messier debate about AI ethics, data ownership, and who actually has the right to train these models. Should OpenAI be allowed to monopolize access to AI-generated content? Should AI companies be required to compensate data sources retroactively? And if training on OpenAI’s outputs is unethical, does that mean OpenAI’s entire foundation is also in question?
One thing’s for sure: OpenAI being shocked that someone might use its data without asking is a masterclass in irony. 🤦♂️
AUTHOR: dpi