The arduous application of the exceptions provided for in European Union copyright law in the context of training machine learning models

INTELLECTUAL PROPERTY RIGHTS

11/8/202529 min read

The previous post, "An introduction to copyright infringement in machine learning training and in the deployment of generative artificial intelligence", explained the debate surrounding the use of copyrighted works and materials protected by related rights for training machine learning (ML) models. Although reproductions occur during the data collection, pre-processing, processing, and deployment phases of ML models, two exceptions in the EU could legitimise some of these acts. These exceptions are found in Art. 5 Infosoc Directive[1] relating to temporary copies, and Arts. 3 and 4 DSM Directive, regulating text and data mining (TDM).[2] Before beginning the analysis, it should be noted that only acts of reproduction would fall within the scope of these exceptions. This means that authorisation from the rightholders would still be required for any adaptation or communication to the public. In my opinion, the copies covered by these exceptions are those created during the “input phase”, i.e. up to the point at which the model is obtained. However, they do not include those produced by the model during deployment.

Temporary copies exception

According to Art. 5(1) Infosoc Directive, five cumulative requirements must be met. The requirements are as follows:

  • The copies must be temporary.

  • The copies must be either transient or incidental. Copies are transient if “their duration is limited to what is necessary for that process to work properly, it being understood that that process must be automated inasmuch as it deletes such an act automatically, without human intervention, once its function of enabling the completion of such a process has come to an end”.[3] In turn, copies are incidental “if they neither exist independently of, nor have a purpose independent of, the technological process of which they form part”.[4]

  • Copies must be an integral part of a technological process. They must be made entirely in the context of implementing a technological process and be necessary for the process to function correctly and efficiently.[5]

  • The reproduction´s sole purpose must be either to enable a transmission in a network between third parties by an intermediary or a lawful use of the protected material. The use is considered lawful if the rightsholder has authorised it or the use is not prohibited by law.[6]

  • Copies cannot have an independent economic significance. This means that the economic advantage resulting from their performance must not be distinct or separable from that derived from the lawful use of the material in question.[7] Furthermore, the reproduction process must not alter the material.[8]

In general, copies made during data pre-processing are usually only stored while various cleaning, formatting and other transformations are being carried out, after which they are automatically deleted. Nevertheless, these operations modify certain aspects of the materials used, which are ultimately represented by tensors. This can cause problems with respect to the fifth requirement.

The Hamburg Regional Court has issued a ruling on this topic. Although the LAION dataset has some rather special properties — it does not store copies of the pre-processed images, but rather provides “lists of URLs to the original images together with the ALT texts found linked to those images”.[9] Still, copies were created when the Common Crawl web pages were filtered and image-text pairs were downloaded to create the dataset. According to LAION, they “downloaded and calculated CLIP embeddings of the pictures to compute similarity scores between pictures and texts” and then “discarded all the photos”.[10] Based on these facts, I have argued in my doctoral thesis that the aforementioned requirements would be met. Nonetheless, the Hamburg Regional Court reached a different conclusion, ruling that the copies created during the creation of LAION were neither transient nor incidental. This is because “the deletion was not carried out independently of the user, but rather due to the defendant's deliberate programming of the analysis process”[11], and, since the images were downloaded for analysis by specific software, downloading them is “not just a process that accompanies the analysis, but a conscious and actively controlled acquisition process that precedes the analysis”.[12] I would like to mention in passing that LAION could be committing acts of communication to the public by offering links to protected works. I won't dwell on this issue here, as it could be the subject of a separate post (or even a doctoral thesis!). In the meantime, I recommend that readers take a look at Svensson[13] and GS Media[14] to start considering whether this would be the case.

Then, during the processing of the data, temporary copies are produced for iteration training purposes. These are generally deleted automatically once the process is finished or interrupted. Accordingly, they can be considered transient and incidental. Once the model has been trained, it should not contain copies of the training data unless overfitting or memorisation occurred during training. Nevertheless, difficulties may arise in meeting the fourth and fifth requirements. This is because the copies must enable a lawful use. It would therefore be necessary to observe the terms regulating access to and use of protected materials, ensuring that TDM or ML training are not among the prohibited uses. Additionally, there is still debate about whether the prohibition of copies having “independent economic significance” should be interpreted narrowly or broadly. I defend the former approach on the basis that it is the deployment of ML models that produces economic benefits, not the temporary copies themselves.

In certain cases, I do believe that this exception could be useful. This is confirmed by Recital 9 DSM Directive, which states that “there can also be instances of text and data mining that do not involve acts of reproduction or where the reproductions made fall under the mandatory exception for temporary acts of reproduction provided for in Article 5(1) of Directive 2001/29/EC, which should continue to apply to text and data mining techniques that do not involve the making of copies beyond the scope of that exception”. However, I would not advise relying on it exclusively for ML projects, given its limited scope and the lack of current information on how this exception will be interpreted by the courts in the scenario at hand.

Text and data mining exceptions (or limitations)

Before analysing the conditions of these exceptions, it is necessary to establish whether ML training, particularly the training of models that power generative AI systems, is within their material scope. Under Art. 2(2) DSM Directive, TDM is defined as “any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations”. Thus, TDM encompasses techniques other than ML, although, given the broad reach of the definition, ML should, in principle, be covered. That said, the case of generative AI seems to raise doubts, and some argue that training models that power generative AI systems go beyond TDM.[15] In my opinion, however, the AI Act[16] clarifies this issue, since it contains obligations for general-purpose AI (GPAI) model[17] providers to implement a policy to comply with EU copyright and related rights law, including the terms of Art. 4 DSM Directive. Again, contrary opinions can be found.[18] While I disagree with them, I would like to take this opportunity to raise the question of whether the AI Act is the appropriate regulation for addressing this issue.

Turning to the substance of the exceptions, Arts. 3 and 4 DSM Directive both stipulate that there must be “lawful access” to the protected material. This means that using pirated copies obtained from shadow libraries would be strictly prohibited in the EU. The personal scope of the exceptions differs, however. Art. 3 DSM Directive provides an exception for reproductions “made by research organisations and cultural heritage institutions for the purposes of scientific research”. There is some uncertainty surrounding the extent of this exception and the definition of “scientific research purpose”. For example, in the aforementioned LAION case, it was debated whether the creation of datasets by this “non-profit” organisation dedicated to research that could subsequently be used by commercial actors would be covered by Art. 3 DSM Directive. According to the Hamburg Regional Court, the answer is affirmative.[19] The Court also emphasises that, in the present case, only the applicability of the exception to the creation of the dataset was raised, not its subsequent use for training ML models, which would be a separate issue.[20] In any case, when companies use works or other protected subject matter for commercial purposes, they must refer to Art. 4 DSM Directive. This exception is conditional on rightholders not having “opted out”, i.e. not having expressly reserved their rights in an appropriate manner. Recital 18 complements Art. 4 DSM Directive, stating that “this exception or limitation should only apply where the work or other subject matter is accessed lawfully by the beneficiary, including when it has been made available to the public online, and insofar as the rightholders have not reserved in an appropriate manner the rights to make reproductions and extractions for text and data mining. In the case of content that has been made publicly available online, it should only be considered appropriate to reserve those rights by the use of machine-readable means, including metadata and terms and conditions of a website or a service. Other uses should not be affected by the reservation of rights for the purposes of text and data mining. In other cases, it can be appropriate to reserve the rights by other means, such as contractual agreements or a unilateral declaration. Rightholders should be able to apply measures to ensure that their reservations in this regard are respected”.

This opt-out mechanism is causing headaches all round. Firstly, the DSM Directive does not specify who can exercise it. While I argue that it should be the rightholders who opt-out or expressly authorise another entity to do so on their behalf — such as large content aggregators, copyright management organisations (CMOs), or independent entities — this is a highly debated issue. In this regard, it is unclear whether the opt-out can be exercised by those to whom the rightholder has delegated the management of the right of reproduction, or even via implied authority through agency principles and the duties of licensees.[21] Once again, the Hamburg Regional Court differs from my view, ruling that the holder of the rights to a photograph can invoke the rights reservation made by a photo agency on its website, to which use and sublicensing rights had been granted.[22] Several CMOs have also opted out on behalf of works in their repertoire via unilateral declarations[23], and some CMOs´ membership authorisation explicitly requests that members grant them the authority to declare an opt-out.[24]

Secondly, the meaning of the term “machine-readable” is uncertain, as is the implementation of the reservation of rights. Should it be implemented for each work, for the entire repertoire, or on an individual basis for all ML projects? This complicates matters for both rightholders and AI developers. The DMS Directive does not define the term “machine-readable format”, but Art. 1(2) Directive 2012/37/EU (now repealed by Directive (EU) 2019/1024) defines it as “a file format structured so that software applications can easily identify, recognize and extract specific data, including individual statements of fact, and their internal structure”. Nonetheless, the Hamburg Regional Court disregarded this definition when ruling on Art. 4 DSM, holding that a declaration in natural language on a website covering all its photos was “machine-readable”, as the defendant had the technology to recognise such a reservation.[25] Conversely, the Amsterdam District Court adopted a more restrictive stance, ruling that “the Claimants had failed to prove that any TDM on their websites was explicitly reserved in machine-readable means, because the prohibition on automated searches which had been included in their evidence only excluded specific AI bots such as GPTBot, ChatGPT-User, CCBOT, and anthropic-ai”.[26] Disputes are likely to continue until the term “machine-readable format” is clarified by the CJEU.

Art. 53.1(c) AI Act obliges GPAI model providers to implement a policy to comply with Union law on copyright and related rights, “and in particular to identify and comply with, including through state-of-the-art technologies, a reservation of rights”. On this point, I am still not convinced that the AI Act is the most suitable instrument for promoting compliance with Art. 4 DSM Directive. Complex dynamics that are difficult to manage could arise, but this is the situation the industry must work with. In this respect, one must remember that copyright protection is national. However, according to Art. 2(a)(1), the AI Act applies to providers “irrespective of whether those providers are established or located within the Union or in a third country”. What's more, in accordance with Recital 106 AI Act, the disclosure and transparency obligations of Art. 53 apply once a model is placed on the EU market “regardless of the jurisdiction in which the copyright-relevant acts underpinning the training of those GPAI models take place”. It is yet to be seen how this will be applied in practice when GPAI models are trained outside the EU. Whatever the case, it should also be noted that this rule has a fairly limited personal scope, as it only applies to providers, defined in Art. 3(3) AI Act as “a natural or legal person, public authority, agency or other body that develops an AI system or a general-purpose AI model or that has an AI system or a general-purpose AI model developed and places it on the market or puts the AI system into service under its own name or trademark, whether for payment or free of charge”. Therefore, entities solely responsible for web scraping and crawling (although the commitments of the GPAI Code of Practice also apply when signatories have web crawlers used on their behalf, as we will see below[27]), as well as providers of datasets and other types of AI models that are not GPAI, would not be obliged to implement the aforementioned policy. Now, I would like to emphasise that I consider it appropriate that the debate on copyright infringement in ML training and deployment focuses on generative AI models. When models are trained using protected materials to produce creative content, the outputs themselves may not infringe IP rights, but the models are fed the expressive elements of the same. Nevertheless, many models are trained using works or materials protected by related rights as mere data. Being very strict about IP rights here could be very costly for innovation. Therefore, I believe it is necessary to make legal distinctions according to the application of the model or AI system in question.

Returning to the subject at hand, the GPAI Code of Practice, which was drawn up in accordance with Art. 56 AI Act, contains a section that is intended to contribute to the proper application of this obligation. To date, 27 companies have adhered to the Code.[28] First of all, the Code provides two measures to “help to ensure” that signatories only reproduce and extract legally accessible content when collecting data via web crawlers. The first measure is a commitment not to circumvent effective technological measures, as defined in Art. 6(3) of the InfoSoc Directive.[29] This is nothing new, but rather a legal obligation which, in fact, is reinforced by Recital 7 DSM Directive, which states that “the protection of technological measures established in Directive 2001/29/EC remains essential to ensure the protection and the effective exercise of the rights granted to authors and to other rightholders under Union law. Such protection should be maintained while ensuring that the use of technological measures does not prevent the enjoyment of the exceptions and limitations provided for in this Directive. Rightholders should have the opportunity to ensure that through voluntary measures. They should remain free to choose the appropriate means of enabling the beneficiaries of the exceptions and limitations provided for in this Directive to benefit from them. In the absence of voluntary measures, Member States should take appropriate measures in accordance with the first subparagraph of Article 6(4) of Directive 2001/29/EC, including where works and other subject matter are made available to the public through on-demand services”. The second measure is that signatories commit to “exclude from their web-crawling websites that make available to the public, at the time of web-crawling, recognised as persistently and repeatedly infringing content on a commercial scale by courts or public authorities in the EU and the European Economic area (EEA)”.[30] To enable compliance with this measure, the Code stipulates that a listing of such websites with hyperlinks must be made available on an EU website.[31]

Subsequently, the code provides for measures to ensure that the signatories identify and comply with machine-readable reservations of rights when using web-crawlers, including through state-of-the-art technologies. The first measure is that they commit to using web-crawlers that can read and follow instructions expressed in accordance with the Robot Exclusion Protocol (robots.txt).[32] The second measure is that the signatories commit to identifying and complying with “other appropriate machine-readable protocols” that express rights reservations and have been adopted by international or European standardisation organisations, or that are state-of-the-art. In this last regard, the protocols must be technically implementable and widely adopted by rightholders, taking into account the differences between the various sectors, and be “generally agreed through an inclusive process based on bona fide discussions to be facilitated at the EU level with the involvement of rightholders, AI providers and other relevant stakeholders”.[33] In this sense, I contend that the code provides an adequate solution. For the opt-out mechanism to work in practice, different stakeholders must share an understanding of how it should be implemented. Furthermore, the code stipulates that this may be an “interim solution” while standards are developed, to which signatories must contribute voluntarily and in good faith.[34] Various protocols currently exist for implementing rights reservation, including the TDM Reservation Protocol (TDMRep)[35], C2PA[36] and the ai.txt Protocol[37]. The EUIPO report “The Development of Generative Artificial Intelligence from a Copyright Perspective” analyses the pros and cons of each.[38] Needless to say, protocols must be implementable by rightholders. From this perspective, it is problematic if each AI developer promotes a different method of rights reservation on their respective platform. At the same time, to make this requirement manageable for AI developers — who are not necessarily big tech companies — it is also undesirable to expect them to observe an unlimited number of protocols every time they compile data, as this would make the task unbearable. The correct path is therefore standardisation. Additionally, the EC has commissioned a study to explore the feasibility of a central registry of opt-outs under the TDM exception.[39] The code also clarifies that these commitments do not affect the application of copyright and related rights law when signatories use content collected from the internet by third parties.[40] Further, it provides transparency measures. Signatories must therefore make public information about the crawlers they use, their robots.txt features and other measures adopted to identify and comply with rights reservations.[41] Additionally, signatories must provide a point of contact for communicating with rightholders and adopt a mechanism through which they can submit non-compliance complaints.[42] Finally, the code stipulates that signatories must take measures to mitigate the risk of downstream AI systems generating infringing outputs. This can be achieved by adopting technical safeguards and prohibiting the use of the model for copyright-infringing purposes in their acceptable use policy.[43]

It should be highlighted at this point that transparency is a key issue in this debate, given that AI developers tend not to disclose the content of training datasets. This makes it challenging for rightholders to demonstrate infringement of their rights. For this reason, Article 53(1)(d) AI Act requires GPAI model providers to “draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model, according to a template provided by the AI Office”. Recital 107 AI Act clarifies that “this summary should generally be comprehensive in its scope instead of technically detailed to facilitate parties with legitimate interests, including copyright holders, to exercise and enforce their rights under Union law, for example by listing the main data collections or sets that went into training the model, such as large private or public databases or data archives, and by providing a narrative explanation about other data sources used”. All this “taking into account the need to protect TSs and confidential business information”. The Recital goes on to say that the AI Office will provide a template for the summary, which “should be simple, effective, and allow the provider to provide the required summary in narrative form”. This document arrived on 24 July 2025. It requires GPAI model providers to disclose the data used at every stage of model training, as well as the sources and types of data used, whether protected or not.[44] In general terms, the summary contains three phases. The first phase contains general information, including the characteristics of the training data and a description of the content included.[45] The second phase contains a list of data sources in which providers disclose the main datasets used and provide a narrative description of the data scraped and other data sources. In this regard, information must be provided on the datasets used that are publicly available, private non-publicly available datasets obtained from third parties, data crawled and scraped from online sources, user data, synthetic data, and other sources. Regarding data scraped from online sources, the EC requires GPAI model providers to list “the top 10% of all domain names determined by the size of the content scraped”. SMEs should “disclose the top 5% of all domain names or 1000 internet domain names, whichever is lower”.[46] Additionally, the EC recommends that, for domain names not listed, GPAI model providers act in good faith and, voluntarily, enable parties with a legitimate interest — including rightholders — to obtain information on whether the provider has scraped and used content, including protected works and other subject matter that rightholders have made available on specific internet domains.[47] The EC warns that this is without prejudice to the mechanisms that rightholders may use in accordance with Art. 8 IPR Enforcement Directive.[48] Lastly, the third phase contains information on relevant aspects of data processing to enable rightholders to exercise their rights. This includes details of the measures implemented to identify and comply with the reservation of rights and the measures taken to avoid or remove illegal content from the training data.[49] Overall, rightholders will gain insight into the legal or illegal origin of the sources used. Combining this obligation with the use of technical tools to detect infringement, such as Have I Been Trained[50] and Kudurru.ai[51], enhances rightholders' ability to provide evidence.

Now, suppose that a particular work has been used to train a model and the rightholder later exercises the opt-out mechanism. In this case, it is worth asking whether this reservation of rights would have retroactive effects. In my opinion, for legal certainty, the answer should be no. Furthermore, once a dataset containing protected material is in circulation (which may be illegal, but still happens), it will be difficult to effectively exercise the reservation of rights for every use.

Another notable aspect of the TDM exceptions is that, while Art. 3 DSM Directive states that copies of protected material “shall be stored with an appropriate level of security and may be retained for the purposes of scientific research, including for the verification of research results”; Art. 4 dictates that copies “may be retained for as long as is necessary for the purposes of text and data mining”. Strict application of the latter provision could cause problems if copies need to be kept for compliance purposes, to identify and correct possible biases in training datasets, or to monitor possible failures in the model's operation.

Overall, the TDM exception set out in Art. 4 DSM Directive has increased search and transaction costs, making life more complicated for both rightholders and AI developers. Given the large amount of data required for ML training, individual licences are unworkable, whereas collective management offering a “one-stop shop” is much more practical. While some AI developers have managed to strike deals with major corporations, it remains to be seen whether individual rightholders will benefit from these deals.[52] Moreover, although not widespread, there are voluntary collective management licensing initiatives.[53] As previously mentioned, some CMOs have reserved the rights over the works in their repertoire and are developing methods to promote collective management licensing. In Spain, the Ministry of Culture promoted a draft Royal Decree to regulate the granting of collective licences with extended effect for the mass exploitation of IP-protected content for training GPAI models.[54] However, it was criticised and did not go ahead. I am not against the promotion of collective management by CMOs, but some of the proposed initiatives do not align well with the content of the DSM Directive. Given this situation, the question of whether it would be more appropriate to replace the current legal regime with a compulsory licence scheme managed by CMOs is a much-debated issue. Under such a scheme, AI developers would have access to data with which to train their models, and rightholders would receive remuneration.[55] That said, several things must be taken into account. Firstly, collective management organisations operate on a national or extraterritorial basis through complex representation agreements. Yet, multi-territorial and, above all, multi-territorial and multi-repertoire licences are not a reality for many types of uses of works and protected content. Secondly, maintaining the repository and providing content in appropriate formats is complex. Thirdly, there is the issue of establishing the criteria for calculating the fair remuneration due to each rightholder. In this regard, it should be noted that Art. 16(2) Directive 2014/26/EU states that the tariffs set by CMOs “shall be reasonable in relation to, inter alia, the economic value of the use of the rights in trade, taking into account the nature and scope of the use of the work and other subject-matter, as well as in relation to the economic value of the service provided by the collective management organisation”. The use of protected content for ML training is more complex than many of the usage models known to date. That is to say, it is highly challenging to determine the weight of each input in the functioning of the models and their influence on the outputs. Given the aforementioned difficulties, proposals have emerged to introduce an output-based levy managed by CMOs.[56] This is certainly a less cumbersome and more practical option, although it raises the issue of justification when the outputs do not infringe any rights. Another approach that has been suggested is to implement a “token-based royalty system grounded in the marginal utility of training data”.[57]

For all the reasons mentioned, I believe that a compulsory collective licensing system may not work well in practice at the moment. As I argued in my thesis, “significant investment in the proper technology to maintain such a system, as well as much discussion with several stakeholders on the methods of calculation and distribution of the remuneration obtained, would be required before it could become a viable solution”.[58] That said, research into attribution and traceability is ongoing. For example, GEMA has a licensing method covering the works in its repertoire for AI, based on two components. The first is based on AI training and comprises a minimum royalty plus a standard royalty of 30% of all net income generated by the generative AI model or system. According to GEMA, this component targets “all providers of generative AI services operating within the German market”. Moreover, “the license is not dependent on where the training takes place; instead, it is linked to the generation of output. This approach acknowledges recent scientific findings, suggesting that the training data remains embedded in the trained models” (in my view, this is a controversial assertion). The second component is based on “the economic benefits that can arise from the subsequent use of AI-generated music content”. In this regard, rightholders must receive “an appropriate share of the additional income generated by AI-produced songs. This share must be at least equivalent to what would have been provided for purely human-generated works”.[59] We will have to wait and see if there are any complaints about the tariffs in the near future, and how any issues are resolved should they arise. It has also been reported that GEMA has a mechanism in the pilot phase, which has not been publicly disclosed, to address the issue of attribution at scale.[60] However, and I may be mistaken, it does not seem that the current system will be changed to introduce compulsory collective licensing. Therefore, licensing methods will have to be organised in accordance with Art. 4 DSM Directive. Although attempts are being made, this is an arduous task. That said, if the CMOs are allowed to exercise the rights of their members without prior express consent and implement a licensing mechanism, the practical result would be much the same as creating compulsory licensing schemes.

In conclusion, the situation is complex. Currently, there is much uncertainty surrounding the implementation and respect of the rights reserved under Art. 4 DSM Directive. Once this issue has been clarified, it becomes necessary to discuss how to organise and calculate the remuneration of rightholders who wish to license their works or material protected by related rights. In this regard, one must be practical to avoid stifling innovation, bearing in mind that high search and transactional costs could result in only a select few being able to develop generative AI. At the same time, it must be ensured that rightholders effectively receive fair and adequate remuneration for the use of their works in these projects, without benefiting the works of certain rightholders over others, and effectively monitoring that remuneration (although I fear that while collectively it may be substantial, per work or per protected material it will be scarce) reaches individual rightholders. For the time being, it seems that this will be a protracted issue, with much still to be discussed.

Litigation in the European Union

Although the spotlight has recently been on the numerous generative AI lawsuits filed in the US, cases have also begun to emerge in EU Member States' courts. For instance, in November 2024, GEMA sued OpenAI at the Munich Regional Court for using lyrics from nine well-known German composers from its repertoire to train ChatGPT without authorisation. GEMA alleged that ChatGPT memorises the lyrics it was trained on and can reproduce them when prompted.[61] Following the oral hearing on 29 September 2025, the Court found it to be undisputed that the large language model was trained using the nine lyrics in question. Then, on 11 November 2015, the Court held that both the memorisation of lyrics in language models and their reproduction in chatbot outputs constitute infringements of copyright. Furthermore, these infringements are not covered by any exceptions, particularly that relating to TDM.[62]In January 2025, GEMA filed another complaint against Suno, again before the Munich Regional Court, concerning the training of generative AI systems that produce audio recordings with songs from its repertoire, for which authors have not received remuneration. GEMA also pointed out that some of these outputs infringe the copyright of the rightholders it represents.[63] Similarly, in March 2025, the Syndicat national de l’édition (SNE – French Publishers' Association), the Société des Gens de Lettres (SGDL – French Society of Writers), and the Syndicat national des auteurs et des compositeurs (SNAC – French Authors' and Compositors' Association) sued Meta before the Paris Judicial Court for training generative AI models with protected content without permission. They alleged copyright infringement and economic free-riding.[64]

A preliminary ruling requested by the Budapest Environs Regional Court is also awaiting a response from the CJEU that could clarify some of the issues raised in this post. The facts of the case are as follows: A Hungarian publisher and news portal operator sued Google for its conduct relating to the development and deployment of Gemini — specifically, its generative AI chatbot functions. According to the plaintiff, when prompted, Gemini provides detailed information about the content of the plaintiff's press publications, sometimes creates summaries, and when it quotes a longer passage from a website, it identifies the source and enables the user to access it with a click. It is also reported that Gemini uses the Google Search database to collect data.[65] The plaintiff also alleges that the use of protected works during Gemini's training infringed its reproduction rights, and that the chatbot's outputs constitute further instances of reproduction and a communication to the public. In this regard, it should be noted that, in accordance with Art. 15 DSM Directive, press publication publishers have the right of reproduction contained in Art. 2 InfoSoc Directive, as well as the right of communication to the public set out in Art. 3(2) InfoSoc Directive, for the online use of their publications by information society service providers. However, the protection does not cover acts of hyperlinking and does not apply “in respect of the use of individual words or very short extracts of a press publication”. In the plaintiff's view, bearing in mind the legislative objective of Art. 15 DSM Directive, it should be concluded that displaying protected content beyond mere reference harms the economic interests of publishers.[66] Google, for its part, maintains that the aforementioned acts do not constitute reproductions or communication to the public. Regarding this final point, Google argues that the Chatbot's responses do not reach a new audience; rather, the audience is all internet users, who would be the same as those with access to the protected content. Google also claims that the displayed content does not exceed individual words or very short extracts. In any case, Google contends any reproductions would be covered by the exceptions set out in Art. 5(1) InfoSoc Directive and Art. 4 DSM Directive.[67] Given these facts, the four questions raised are reproduced below:

“1. Must Article 15(1) of Directive (EU) 2019/790(1)and Article 3(2) of Directive 2001/29/EC(2)be interpreted as meaning that the display, in the responses of an LLM-based chatbot, of a text partially identical to the content of web pages of press publishers, where the length of that text is such that it is already protected under Article 15 of Directive 2019/790, constitutes an instance of communication to the public? If the answer to that question is in the affirmative, does the fact that [the responses in question are] the result of a process in which the chatbot merely predicts the next word on the basis of observed patterns have any relevance?

2. Must Article 15(1) of Directive 2019/790 and Article 2 of Directive 2001/29 be interpreted as meaning that the process of training an LLM-based chatbot constitutes an instance of reproduction, where that LLM is built on the basis of the observation and matching of patterns, making it possible for the model to learn to recognise linguistic patterns?

3. If the answer to the second question referred is in the affirmative, does such reproduction of lawfully accessible works fall within the exception provided for in Article 4 of Directive 2019/790, which ensures free use for the purposes of text and data mining?

4. Must Article 15(1) of Directive 2019/790 and Article 2 of Directive 2001/29 be interpreted as meaning that, where a user gives an LLM-based chatbot an instruction which matches the text contained in a press publication, or which refers to that text, and the chatbot then generates its response based on the instruction given by the user, the fact that, in that response, part or all of the content of a press publication is displayed constitutes an instance of reproduction on the part of the chatbot service provider?”.[68]

I am sure we will all be eagerly awaiting the CJEU's response! I promise that I will write another post addressing the issue of whether outputs created using generative AI systems may infringe the right of communication to the public. In the meantime, if you would like to explore the topics discussed in this post in greater depth, I recommend some further reading:

  • Brando, Axel: `Technological Aspects of Generative AI in the Context of Copyright. Attribution and Novelty in Generative AI Hypersurfaces´ (2025) <https://www.europarl.europa.eu/thinktank/en/document/IUST_BRI(2025)776529>.

  • Espitia Restrepo, Santiago / Montesdeoca, Becky: `Munich AI ruling could reshape EU interpretation of copyright´ (2025) World Trademark Review <https://www.4ipcouncil.com/research/munich-ai-ruling-could-reshape-eu-interpretation-copyright>

  • European Parliament, Committee on Legal Affairs: Amendments 1 – 370, Draft report Axel Voss (PE775.433v01-00) Copyright and generative artificial intelligence – opportunities and challenges (2025/2058(INI))16.9.2025 <https://www.europarl.europa.eu/doceo/document/JURI-AM-777015_EN.pdf>.

  • European Union Intellectual Property Office: `Development of Generative Artificial Intelligence from a Copyright Perspective´ (2025) <https://www.euipo.europa.eu/en/publications/genai-from-a-copyright-perspective-2025>.

  • Geiger, Christophe / Iaia, Vincenzo: `The Forgotten Creator: Towards a Statutory Remuneration Right for Machine Learning of Generative AI´ (2024) 52 Computer Law & Security Review.

  • González Otero, Begoña: `Las excepciones de minería de textos y datos más allá de los derechos de autor: La ordenación privada contraataca´ (2019) <https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3477197>.

  • Guadamuz, Andrés: `A Scanner Darkly: Copyright Infringement in Artificial Intelligence Inputs and Outputs´ (2024) 73 (2) GRUR International 111, 127.

  • Guadamuz, Andrés: `First case on AI and copyright referred to the CJEU´ (TechnoLlama, May 27, 2025) <https://www.technollama.co.uk/revisiting-copyright-infringement-in-ai-inputs-and-outputs>.

  • Guadamuz, Andrés: `How AI is breaking traditional remuneration models´ (TechnoLlama, July 9, 2025) <https://www.technollama.co.uk/revisiting-copyright-infringement-in-ai-inputs-and-outputs>.

  • Guadamuz, Andrés: `Revisiting copyright infringement in AI inputs and outputs´ (TechnoLlama, July 30, 2025) <https://www.technollama.co.uk/revisiting-copyright-infringement-in-ai-inputs-and-outputs>.

  • Hoffmann Jörg: `Technological Determination of AI-Relevant Press and Copyright Law and Generative Content's Relevance for EU Competition Law -The referral in Case C-250/25, Like Company v. Google Ireland Ltd.´ (2025) <https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5411443>.

  • Jiménez Serranía, Vanessa: `Datos, minería e innovación: qvo vadis, Europa? Análisis sobre las nuevas excepciones para la minería de textos y datos´ (2020) 12(1) Cuadernos de Derecho Transnacional 247, 258.

  • Lucchi, Nicola: `Generative AI and Copyright, Training, Creation, Regulation´ (2025)<https://www.europarl.europa.eu/thinktank/en/document/IUST_STU(2025)774095>.

  • Margoni, Thomas / Kretschmer, Martin: `A Deeper Look into the EU Text and Data Mining Exceptions: Harmonisation, Data Ownership, and the Future of Technology´ (2022) XX(XX) GRUR International 685, 701.

  • Nadal Cebrián, Marta: `Pueden los sistemas de inteligencia artificial «aprender» utilizando obras protegidas por derechos de propiedad intelectual sin autorización de sus titulares. Los retos del Machine Learning en perspectiva comparada: UE, EEUU y Japón´ (2021) 68 Pe. i.: Revista de propiedad intelectual 15, 74.

  • Sánchez Aristi, Rafael / Bourkaib, Álvaro: `Informe sobre el posible establecimiento de un mecanismo de licencia colectiva con efecto ampliado, o de remuneración o compensación equitativa de gestión colectiva obligatoria, vinculado a actividades de minería de textos y datos efectuadas bajo la excepción o limitación del artículo 4 de la Directiva 2019/790´ (2024) 77 Pe. i.: Revista de propiedad intelectual 71, 114

  • Senftleben, Martin: `Generative AI and Author Remuneration´ (2023) 54 International Review of Intellectual Property and Competition Law 1535, 1560.


[1] Information Society Directive (Directive 2001/29/EC).

[2] Directive on Copyright in the Digital Single Market (Directive (EU) 2019/790).

[3] C‐5/08, Infopaq International A/S v Danske Dagblades Forening (2009) ECLI:EU:C:2009:465, para. 64.

[4] C‐360/13, Public Relations Consultants Association Ltd v Newspaper Licensing Agency Ltd and Others (2014) ECLI:EU:C:2014:1195, para. 43.

[5] C‐302/10, Infopaq International A/S v Danske Dagblades Forening (2012) ECLI:EU:C:2012:16, para. 30, 31.

[6] Recital 33 InfoSoc Directive; C‐302/10, Infopaq International A/S v Danske Dagblades Forening (2012) ECLI:EU:C:2012:16, para. 42.

[7] C‐302/10, Infopaq International A/S v Danske Dagblades Forening (2012) ECLI:EU:C:2012:16, para. 51.

[8] C‐302/10, Infopaq International A/S v Danske Dagblades Forening (2012) ECLI:EU:C:2012:16, para. 52, 53.

[9] See LAION's FAQ nº1 <https://laion.ai/faq/>.

[10] Ibid.

[11] LG Hamburg, Urteil vom 27.09.2024, Az.: 310 O 227/23, para. 63 (own translation).

[12] LG Hamburg, Urteil vom 27.09.2024, Az.: 310 O 227/23, para. 66 (own translation).

[13] C-466/12, Nils Svensson, Sten Sjögren, Madelaine Sahlman, Pia Gadd v Retriever Sverige AB (2014) ECLI:EU:C:2014:76.

[14] C-160/15, GS Media BV v Sanoma Media Netherlands BV, Playboy Enterprises International Inc., Britt Geertruida Dekker (2016) ECLI:EU:C:2016:644.

[15] See Tim W. Dornis und Sebastian Stober, `Urheberrecht und Training generativer KI-Modelle – technologische und juristische Grundlagen´ (2024) <https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4946214> 87, 97; Nicola Lucchi, `Generative AI and Copyright, Training, Creation, Regulation´ (2025) <https://www.europarl.europa.eu/thinktank/en/document/IUST_STU(2025)774095> 41, 50.

[16] Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending Regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and Directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828.

[17] Art. 3(63) AI Act, “general-purpose AI model” means “an AI model, including where such an AI model is trained with a large amount of data using self-supervision at scale, that displays significant generality and is capable of competently performing a wide range of distinct tasks regardless of the way the model is placed on the market and that can be integrated into a variety of downstream systems or applications, except AI models that are used for research, development or prototyping activities before they are placed on the market”.

[18] Nicola Lucchi (n. 15) 51, 54.

[19] LG Hamburg, Urteil vom 27.09.2024, Az.: 310 O 227/23, para. 109, 128.

[20] LG Hamburg, Urteil vom 27.09.2024, Az.: 310 O 227/23, para. 72.

[21] European Union Intellectual Property Office (EUIPO), `Development of Generative Artificial Intelligence from a Copyright Perspective´ (2025) <https://www.euipo.europa.eu/en/publications/genai-from-a-copyright-perspective-2025> 48.

[22] LG Hamburg, Urteil vom 27.09.2024, Az.: 310 O 227/23, para.94, 96.

[23] See inter alia <https://createurs-editeurs.sacem.fr/actualites-agenda/actualites/la-sacem-et-vous/pour-une-intelligence-artificielle-vertueuse-transparente-et-equitable-la-sacem-exerce-son-droit>; <https://www.sabam.be/en/press/sabam-safeguards-rights-its-authors-ai-use>; <https://pictoright.nl/en/general/collectieve-opt-out-pictoright-aangeslotenen/>; <https://www.artisjus.hu/felhasznaloknak/mas-felhasznalas/ai-tdm-objection/#english>; <https://vegap.es/que-es-vegap-legislacion-inteligencia-artificial/>; <https://www.aepo-artis.org/adami-and-spedidam-exercise-the-opt-out-right/>.

[24] See FAQ number 4 <https://www.gema.de/en/news/ai-and-music/ai-lawsuit>.

[25] LG Hamburg, Urteil vom 27.09.2024, Az.: 310 O 227/23, para.98, 106.

[26] Alessandro Cerri, `Dutch court holds that TDM opt-out must be done by "machine-readable" means´ (The IPKat, 16 February 2025) <https://ipkitten.blogspot.com/2025/02/dutch-court-holds-that-tdm-opt-out-must.html>.

[27] Code of Practice for General-Purpose AI Models, Copyright Chapter, p. 5.

[28] See <https://digital-strategy.ec.europa.eu/en/policies/contents-code-gpai>.

[29] Code of Practice for General-Purpose AI Models, Copyright Chapter, p. 4

[30] Code of Practice for General-Purpose AI Models, Copyright Chapter, p. 4,5.

[31] Ibid.

[32] Code of Practice for General-Purpose AI Models, Copyright Chapter, p. 5

[33] Ibid.

[34] Ibid.

[35] See <https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240510>.

[36] See <https://cawg.io/training-and-data-mining/1.1/>.

[37] See <https://arxiv.org/html/2505.07834v1>.

[38] EUIPO (n. 21) 164, 229.

[39] See

<https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/tender-details/8726813a-bd9b-4f58-8679-01c80f7a1abf-CN>.

[40] Code of Practice for General-Purpose AI Models, Copyright Chapter, p. 5

[41] Code of Practice for General-Purpose AI Models, Copyright Chapter, p. 6.

[42] Ibid.

[43] Ibid.

[44] European Commission (EC), Annex to the Communication to the Commission Approval of the content of the draft Communication from the Commission – Explanatory Notice and Template for the Public Summary of Training Content for general-purpose AI models required by Article 53 (1)(d) of Regulation (EU) 2024/1689 (AI Act), Brussels, 24.7.2025 C(2025) 5235 final, 3, 4.

[45] EC (n. 44) 9, 10.

[46] EC (n. 44) 10, 13.

[47] EC (n. 44) 4.

[48] Ibid.

[49] EC (n. 44) 13.

[50] See <https://spawning.ai/have-i-been-trained>.

[51] See <https://kudurru.ai/>.

[52] See, for example, the deals of OpenAI with Axel Springer <https://intellectual-property-helpdesk.ec.europa.eu/news-events/news/openai-partners-news-publisher-axel-springer-meta-faces-claims-authors-infringing-copyright-2023-12-22_en>; OpenAI with Le Monde and Prisa Media

<https://openai.com/index/global-news-partnerships-le-monde-and-prisa-media/>; OpenAI with the Financial Times <https://aboutus.ft.com/press_release/openai>; and Google with Reddit <https://blog.google/inside-google/company-announcements/expanded-reddit-partnership/>.

[53] See, for example, Musical AI <https://www.wearemusical.ai/>; Shutterstock <https://www.shutterstock.com/es/data-licensing>; Calliope Networks <https://calliopenetworks.ai/>; Datarade <https://datarade.ai/data-categories/entertainment-data>; Protege <https://www.withprotege.ai/>; Tunecore <https://support.tunecore.com/hc/en-us/articles/18341253558420-TuneCore-s-AI-Data-Protection-Program>; and the Copyright Clearance Center’s RightFind™ XML <https://www.copyright.com/solutions-rightfind-xml/>.

[54] See Proyecto de Real Decreto por el que se regula la concesión de licencias colectivas ampliadas para la explotación masiva de obras y prestaciones protegidas por derechos de propiedad intelectual para el desarrollo de modelos de inteligencia artificial de uso general <https://www.cultura.gob.es/en/dam/jcr:95c986c7-893f-46c6-81d4-3ba822a6696e/proyecto-rd-licencias-colectivas.pdf>.

[55] See Christophe Geiger and Vincenzo Iaia, `The Forgotten Creator: Towards a Statutory Remuneration Right for Machine Learning of Generative AI´ (2024) 52 Computer Law & Security Review.

[56] See Martin Senftleben, `Generative AI and Author Remuneration´ (2023) 54 International Review of Intellectual Property and Competition Law 1535, 1560.

[57] Nicola Lucchi (n. 15) 83.

[58] Marta Duque Lizarralde, Business-to-Business Data Sharing for Artificial Intelligence Development (Nomos, 2025) 110, 111.

[59] GEMA, `Two components - one goal: Music creators shall receive fair shares through effective AI licensing´ (GEMA News, October 17, 2024) <https://www.gema.de/en/w/generative-ai-licensing-model>.

[60] EUIPO (n. 21) 15.

[61] GEMA, `GEMA files model action to clarify AI providers‘ remuneration obligations in Europe´ (Press release, November 13, 2024) <https://www.gema.de/en/w/gema-files-lawsuit-against-openai>.

[62] See `Verhandlung GEMA geg. OpenAI´(Press release, September 29, 2025) (own translation) <https://www.justiz.bayern.de/gerichte-und-behoerden/landgericht/muenchen-1/presse/2025/9.php>.

[63] GEMA, `Fair remuneration demanded: GEMA files lawsuit against Suno Inc.´ (Press release, January 21, 2025) <https://www.gema.de/en/w/press-release-lawsuit-against-suno>; <https://www.justiz.bayern.de/gerichte-und-behoerden/landgericht/muenchen-1/presse/2025/11.php>.

[64] See Syndicat national de l’édition, `Authors and Publishers Unite in Lawsuit against Meta to Protect Copyright from Infringement by Generative AI Developers´ (Press release, March 18, 2025) <https://www.sne.fr/press-release-authors-and-publishers-unite-in-lawsuit-against-meta-to-protect-copyright-from-infringement-by-generative-ai-developers/>.

[65] Case C-250/25, Summary of the request for a preliminary ruling pursuant to Article 98(1) of The Rules of Procedure of the Court of Justice <https://curia.europa.eu/juris/showPdf.jsf?docid=300681&doclang=EN> 3, 4.

[66] Case C-250/25, Summary of the request (n. 65) 5,6.

[67] Case C-250/25, Summary of the request (n. 65) 6, 7.

[68] Case C-250/25, Like Company: Request for a preliminary ruling from the Budapest Környéki Törvényszék (Hungary) lodged on 3 April 2025 – Like Company v Google Ireland Limited <https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A62025CN0250>.