Analysis: The EU’s AI training data template – clarity for developers, uncertainty for creators

31 Jul 2025

Reading time: 5 minutes

Article Tags:

william fry

Analysis: The EU's AI training data template – clarity for developers, uncertainty for creators

Barry Scannell and Leo Moore

William Fry partners Barry Scannell and Leo Moore welcome the publication of the European Commission’s long-awaited mandatory AI training data template.

From August 2025, the European Commission will require providers of general-purpose AI (GPAI) models to publish a summary of the content used to train those models.

The requirements, introduced under Article 53 of the AI Act, aim to enhance transparency and facilitate the enforcement of rights under EU law, particularly in the context of copyright.

The long-awaited mandatory AI training data template was published by the European Commission on 24 July 2025 and sets out how these summaries must be structured and what information they should contain.

The initiative represents a significant regulatory intervention into how AI developers document the provenance of their training data. However, its practical value for rightsholders, particularly in the context of licensing copyrighted works, remains uncertain.

The template covers a wide range of data categories, including publicly available datasets, private datasets, scraped web content, user data and synthetic data. Providers are required to disclose the source and nature of the data along with a general description of the content involved.

However, the template does not require disclosure of specific details of data or works used to train the AI model. This shows that the purpose of the summaries is to provide transparency in aggregate rather than precision in detail.

Balancing transparency and trade secrecy

The Commission emphasised the need to balance the protection of trade secrets and confidential business information whilst also providing enough transparency to enable parties with legitimate interests to exercise their rights under EU law. Different levels of detail are required in the data training summary depending on the source of the data.

Notably, private datasets not commercially licensed by rightsholders and obtained from third parties must be listed only if publicly known or otherwise described only in a general manner. Therefore, the template explicitly allows providers to withhold detailed information where the information is commercially sensitive.

The Commission also recommends that providers act in good faith and on a voluntary basis to supply additional details beyond the minimum standards, or to offer an “upon request” mechanism whereby rights holders may ask whether their domains were included in scraping activities. But both remain optional. Providers are not legally obliged to grant such requests, nor to respond to them.

Copyright and licensing challenges

Article 4 of the Copyright in the Digital Single Market Directive (CDSM) allows rightsholders to opt out from their copyright-protected content being used for text and data mining (TDM).

However, rightsholders may be unhappy with the template because the AI training data summaries are not required to disclose specific information on the data used, so rightsholders have no way of knowing whether their content has been included.

Furthermore, while the template requires developers to disclose what procedures they followed to identify and honour rightsholders’ requests to opt-out under Article 4 CDSM, the system contains no mechanism for independent verification of whether these opt-out requests were correctly identified or respected. It also does not compel developers to disclose how scraped material was filtered to exclude reserved works.

Another implication of the general nature of the summary is that it does not facilitate rightsholders entering individual licensing arrangements for their works to be used for AI data training. Therefore, the template is more conducive to enabling licensing deals with large publishers and intermediaries. This aligns with expectations that AI providers will continue to seek bulk licences from major content aggregators or collective management organisations, rather than negotiate directly with creators.

Supervisory scope of the AI Office and cross-jurisdictional implications

The AI Office will have the power to verify whether the template has been filled in correctly. However, it will not perform a work-by-work assessment or check whether specific content has been used or not for the training of the GPAI model.

The AI Office has also made clear that it will not adjudicate individual copyright disputes. Any such claims will remain subject to national law and the courts, with the burden of proof falling on rightsholders to establish that their content was used and that their rights were infringed.

This raises important questions about the extent to which the training data summary can serve as a practical tool for rights enforcement.

It is important to also point out that by mandating public disclosures that highlight the categories and sources of training data, the EU may inadvertently create legal exposure for providers in jurisdictions such as the United States, where fair use defences for AI training remain contested. This could, in time, discourage some developers from launching models in the EU, particularly if the reputational or litigation risks are perceived to outweigh the regulatory benefits.

Conclusion

The publication of the template deserves recognition as an early and serious attempt to confront the transparency gap in AI training. No other major jurisdiction has introduced equivalent obligations.

Whether this regulatory model proves to be effective or sustainable will depend largely on its implementation, the willingness of AI providers to adopt more detailed disclosures, and the evolution of judicial interpretation over time.

For now, rightsholders will gain some visibility into how AI systems are built. But for those seeking to identify and license specific works, the current template may fall short of expectations. It is a structural step forward, but one that still leaves significant gaps in practice.

The effectiveness of the transparency regime will likely turn not on the publication of summaries, but on whether those summaries can be meaningfully used to support lawful licensing and rights enforcement.

Barry Scannell and Leo Moore are partners at William Fry LLP.

Article Tags:

Analysis: The EU’s AI training data template – clarity for developers, uncertainty for creators

Article Tags: