OpenAI plans to release a tool called Media Manager in 2025, which will allow authors and content owners to control how their works are used in training AI generative models. This move is a response to criticism OpenAI has faced for collecting publicly available internet data to train its models. In particular, several major American newspapers, including the Chicago Tribune, have sued the company for copyright infringement, alleging that their articles were used for commercial purposes without permission or compensation.
OpenAI explained that creating useful AI models without using copyrighted materials is impossible. Project representatives also pointed out that web scraping for data has been a standard practice for decades, but criticism of it has only arisen now, following the commercial success of some products.
The company claims to be developing a new standard for collaboration with authors, content owners, and regulators. It is expected that Media Manager will provide new options and capabilities for managing model training on copyrighted content.
OpenAI previously provided authors with the option to “opt out” of their content being used in model training. The company also introduced the ability for website owners to control access to their content through robots.txt and entered into licensing agreements with major organizations, including media outlets, image libraries, and websites like Stack Overflow.
However, some experts believe that OpenAI is not doing enough. The current “opt-out” functionality requires uploading each image individually with its description, making mass exclusion of content from the training data set nearly impossible.
Other companies are also working on universal tools to protect data from being used in AI training. For example, the startup Spawning AI offers a database for registering copyrighted works, while projects like Steg.AI and Imatag are developing imperceptible watermarks to protect images.