An introduction to copyright infringement in machine learning training and in the deployment of generative artificial intelligence
INTELLECTUAL PROPERTY RIGHTS
11/8/20258 min read
The question of whether works generated with artificial intelligence (AI) can be protected by copyright has been a hot topic of debate for quite some time now (I myself began studying this topic for my bachelor's thesis, and a lot has happened since then!). Recently, however, the debate surrounding the training of machine learning (ML) models using copyright-protected works or materials protected by related rights has gained prominence. This post will therefore explain what this debate is about.
Protected works and conferred rights
ML is a technique that aims to develop pattern-recognition systems that “learn” to make predictions about new data by analysing previous data. Therefore, it is important to clarify what is meant by “data”. According to European Union (EU) legislation, “data” is “any digital representation of acts, facts or information and any compilation of such acts, facts or information, including in the form of sound, visual or audio-visual recording”.[1] Consequently, works protected by copyright or other related rights can be considered data.
Copyright protects “works”, i.e. the formal expression of an idea or feeling communicated to the public. As works are immaterial, only their form and expression are protected, not their tangible medium or their underlying ideas. For subject matter to be classified as a work, it must be “original”[2], namely, be “the author's own intellectual creation”, which “is manifested by the author's free and creative choices”.[3] Creations that may be protected by copyright include original images, texts, sounds or videos.
Copyright holders are granted moral and economic rights. Moral rights are not harmonised, but Art. 6 bis of the Berne Convention states that rightholders may claim the rights of attribution and integrity. The former consists of “claiming authorship of the work”, while the latter involves “objecting to any distortion, mutilation or other modification of, or other derogatory action in relation to, the said work, which would be prejudicial to his honour or reputation”.
The economic rights most relevant to this debate are those of reproduction, communication to the public, and adaptation. The right of reproduction is defined as the exclusive right to “authorise or prohibit direct or indirect, temporary or permanent reproduction by any means and in any form, in whole or in part”.[4] For its part, the right of communication to the public can be exercised in various ways. A particularly important modality for digital environments is that of “making available to the public the works in such a way that members of the public may access them from a place and at a time individually chosen by them”.[5] Lastly, the right of adaptation, which is not harmonised, can be described as the right to prohibit or authorise the modification of a work to create a different one.[6]
Furthermore, many materials that are not eligible for copyright protection may be covered by related rights, which safeguard the interests of investors. These include the rights of phonogram[7] and audiovisual producers[8], performers[9], press publishers[10], and non-original photograph creators[11]. Typically, rightholders do not enjoy moral rights, but rather economic rights, such as the rights of reproduction and communication to the public. These rights can be assigned to a single entity, although this is not always the case.
Infringement
Once collected, the data is usually stored for further processing. This results in reproductions. Therefore, where protected content is not freely available, the rightholder must authorise access and any further use. When gathering freely available content, bear in mind that just because content is freely available does not mean it can be used without restrictions. Later on, the data must be cleaned, integrated and reformatted. Following this “pre-processing”, the data is transformed into tensors, which are then ready for mechanical processing. Where datasets remain within the company, concerns only arise about the legality of the pre-processing in light of the rights of reproduction and adaptation. Nonetheless, issues of infringement of the right of communication to the public may arise when datasets are further shared. Copies of the samples are also created when they are retrieved and processed. Retrieval can be performed either in batches or in real time. Each training trial iterates over the training data multiple times. These copies are deleted once processing is complete and the trained model has been obtained. This trained model can then be used independently of the training dataset. The trained model, particularly in the field of generative AI, can then be deployed to create new content. Nevertheless, this content can occasionally resemble, or even replicate, works contained in the training dataset.
Litigation has surfaced in this field. Several plaintiffs claim that various AI companies infringed their copyright by creating a training dataset comprising their unlicensed works, which they then used to train ML models. Accordingly, these models and their outputs would be unlawful copies or derivative works of the input works. Notably, the composition of training datasets is often not disclosed by defendants. Nevertheless, the plaintiffs deduce that their works have been exploited thanks to online tools and statements by AI company spokespersons who have acknowledged scraping various websites. Furthermore, when prompted, the models can sometimes generate works in the style of a particular author, provide detailed summaries of books or articles, or reproduce extracts of works. Additionally, some plaintiffs emphasise that studies have shown ML models can “memorise” the inputs. Against this background, plaintiffs claim direct copyright liability from AI companies for actions taken during the pre-training and training phases, alongside secondary liability for copyright infringements committed by users when generating unlawful outputs by prompting the models.[12]
Exceptions and limitations to copyright and related rights
It should be noted that, although infringements of the various exclusive rights mentioned above are discussed, different jurisdictions have exceptions that ML developers could invoke.
European Union
In the EU, the relevant exceptions are those for temporary copies set out in Art. 5(1) InfoSoc Directive and the text and data mining (TDM) exception provided for in Arts. 3 and 4 DSM Directive.
Article 5(1) InsoSoc Directive states that:
“Temporary acts of reproduction referred to in Article 2, which are transient or incidental [and] an integral and essential part of a technological process and whose sole purpose is to enable:
(a) a transmission in a network between third parties by an intermediary, or
(b) lawful use of work or other subject-matter to be made, and which have no independent economic significance, shall be exempted from the reproduction right provided for in Article 2”.
In turn, Art. 3 DSM Directive stipulates that:
“1. Member States shall provide for an exception to the rights provided for in Article 5(a) and Article 7(1) of Directive 96/9/EC, Article 2 of Directive 2001/29/EC, and Article 15(1) of this Directive for reproductions and extractions made by research organisations and cultural heritage institutions in order to carry out, for the purposes of scientific research, text and data mining of works or other subject matter to which they have lawful access.
2. Copies of works or other subject matter made in compliance with paragraph 1 shall be stored with an appropriate level of security and may be retained for the purposes of scientific research, including for the verification of research results.
3. Rightholders shall be allowed to apply measures to ensure the security and integrity of the networks and databases where the works or other subject matter are hosted. Such measures shall not go beyond what is necessary to achieve that objective.
4. Member States shall encourage rightholders, research organisations and cultural heritage institutions to define commonly agreed best practices concerning the application of the obligation and of the measures referred to in paragraphs 2 and 3 respectively”.
Lastly, Art. 4 DSM Directive establishes the following:
“1. Member States shall provide for an exception or limitation to the rights provided for in Article 5(a) and Article 7(1) of Directive 96/9/EC, Article 2 of Directive 2001/29/EC, Article 4(1)(a) and (b) of Directive 2009/24/EC and Article 15(1) of this Directive for reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining.
2. Reproductions and extractions made pursuant to paragraph 1 may be retained for as long as is necessary for the purposes of text and data mining.
3. The exception or limitation provided for in paragraph 1 shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online.
4. This Article shall not affect the application of Article 3 of this Directive".
United States
Whether the use of protected material for ML training is an infringement or a “fair use” will be determined according to the conditions set out in Section 107 of the Copyright Act, which are:
“(1) the purpose and character of the use, including whether such use is commercial or is for nonprofit educational purposes;
(2) the nature of the copyrighted work;
(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
(4) the effect of the use upon the potential market for or value of the copyrighted work”.
Japan
Meanwhile, Art. 30-4 of the Japanese Copyright Act determines the following;
“It is permissible to exploit work, in any way and to the extent considered necessary, in any of the following cases or other cases where such exploitation is not for enjoying or causing another person to enjoy the ideas or emotions expressed in such work; provided, however that this does not apply if the exploitation would unreasonably prejudice the interests of the copyright owner in light of the natures and purposes of such work, as well as the circumstances of such exploitation: (i) exploitation for using the work in experiments for the development or practical realization of technologies concerning the recording of sounds and visuals or other exploitations of such work;
(ii) exploitation for using the work in a data analysis (meaning the extraction, comparison, classification, or other statistical analysis of language, sound, or image data, or other elements of which a large number of works or a large volume of data is composed; and
(iii) in addition to the cases set forth in the preceding two items, exploitation for using the work in the course of computer data processing or otherwise that does not involve perceiving the expressions in such work through the human sense”.
Critics argue that the TDM exceptions set out in Arts. 3 and 4 DSM Directive are too rigid and increase transaction costs, thereby hindering the development of AI in the EU. Moreover, these exceptions are said to be less flexible than those adopted in the US and Japan. In the next two posts, I will evaluate whether the criticism of the EU's approach is justified and whether fair use is indeed more flexible when applied in this context. To gain a better understanding of the exception provided in the Japanese Copyright Act, I recommend the following reading:
Ueno, Tatsuhiro: `The Flexible Copyright Exception for ‘Non-Enjoyment’ Purposes – Recent Amendment in Japan and Its Implication´ (2021) 70(2) GRUR International 145, 152.
[1] Art. 2(4) Regulation (EU) 2022/1925 of 14 September 2022 on contestable and fair markets in the digital sector (Digital Markets Act, DMA); Art. 2(1) Regulation (EU) 2022/868 of 30 May 2022 on European data governance and amending Regulation (EU) 2018/1724 (Data Governance Act, DGA); and Art. 2(1) Regulation (EU) 2023/2854 of 13 December 2023 on harmonised rules on fair access to and use of data (Data Act, DA).
[2] Art. 2 Berne Convention for the Protection of Literary and Artistic Works of September 9, 1886.
[3] Among others, C-145/10, Painer v. Standard Verlags GmbH and others (2011)ECLI:EU:C:2011:798, para. 119, 120; C-604/10, Football Dataco Ltd v. Yahoo! UK Ltd y and others (2012)ECLI:EU:C:2012:115, para. 37, 39; C-403/08and C-429/08, Football Association Premier League v. QC Leisure and Karen Murphy v. Media Protection Services (2011)ECLI:EU:C:2011:631, para. 97.
[4] Art. 2(a) Directive 2001/29/EC of 22 May 2001on the harmonisation of certain aspects of copyright and related rights in the information society (InfoSoc Directive).
[5] Art. 8 WIPO Copyright Treaty (WCT) 1996; Art. 3 1) InfoSoc Directive.
[6] See Eleonora Rosati, `The right of adaptation has not been generally harmonised at the EU level: true or false?` (The IPKat, May 01, 2014) <https://ipkitten.blogspot.com/2014/05/the-right-of-adaptation-has-not-been.html>.
[7] Arts. 5, 10 and 12 of The International Convention for the Protection of Performers, Producers of Phonograms and Broadcasting Organisations, 26 October 1961 (the Rome Convention); and Chapter III of the WIPO Performances and Phonograms Treaty (WPPT), 20 December 1996.
[8] Art. 3 Directive 2006/115/EC of 12 December 2006 on rental right and lending right and on certain rights related to copyright in the field of intellectual property; and Arts. 2and Art.3InfoSoc Directive.
[9] Arts. 4, 7 and 8 Rome Convention, and Chapter II WPPT.
[10] Art. 15 Directive (EU) 2019/790 of 17 April 2019 on copyright and related rights in the Digital Single Market (DSM Directive).
[11] Art. 6 Directive 2006/116/EC of 12 December 2006 on the term of protection of copyright and certain related rights (Term Directive).
[12] See <https://chatgptiseatingtheworld.com/category/map-of-ai-copyright-lawsuits/>; and <https://www.bakerlaw.com/services/artificial-intelligence-ai/case-tracker-artificial-intelligence-copyrights-and-class-actions/>.
